Commit Graph

15 Commits

Author SHA1 Message Date
YeAnbang
03b41d6fb5 upgrade reward functions 2025-05-16 18:04:38 +08:00
YeAnbang
47a7dc7142 Support evaluation during training 2025-05-14 11:03:11 +08:00
YeAnbang
eb6b5dd62e
[fix] revert reward update and evaluation (#6295)
* Revert "rewrite reward fn"

This reverts commit d06042b434.

* Revert "upgrade reward math verification"

This reverts commit a6085ff676.

* Revert "fix bug"

This reverts commit 01640ebd65.

* Revert "reuse comm-group"

This reverts commit bd61918dcf.

* Revert "Support evaluation during training"

This reverts commit 57a88395fe.
2025-05-07 10:56:47 +08:00
YeAnbang
d06042b434 rewrite reward fn 2025-05-01 11:28:05 +08:00
YeAnbang
a6085ff676 upgrade reward math verification 2025-04-30 22:59:54 +08:00
YeAnbang
57a88395fe Support evaluation during training 2025-04-30 18:31:49 +08:00
YeAnbang
14f237ce7e
[feat] Support boxed math reward (#6284)
* fix pp+tp, fix dataloader

* fixed plugin micro-batch size

* support boxed reward

* add boxed reward

* fix pp state dict incomplete issue

* Revert "fix pp state dict incomplete issue"

This reverts commit 6c1b3b694f.
2025-04-29 16:46:47 +08:00
YeAnbang
26d859f68e
[feat] Support DAPO (#6263)
* update help information

* update style

* fix

* minor fix

* support PP training

* add pp support

* remove unused code

* address conversation

* fix memory leakage support tp+pp

* move empty cache

* move empty cache

* add DAPO support

* remove format reward

* fix filtering, still buggy

* small fix

* add DAPO support

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* tested multi-node training; fix bind_batch bug

* fix conversation; support sleep mode

* support reusing excessive samples

* add dynamic batching control flag

* add dynamic batching control flag

* refactored

* fix logging

---------

Co-authored-by: Tong Li <tong.li35271158@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2025-04-25 17:39:17 +08:00
Tong Li
abca66e69f fix reward score 2025-03-11 10:17:32 +08:00
Tong Li
71a0181fce update reward 2025-03-10 14:19:10 +08:00
Tong Li
754b16dfbf update reward fn 2025-03-10 14:18:22 +08:00
Tong Li
d03cdea949 update reward fn 2025-03-06 10:53:48 +08:00
Tong Li
070907dd7f polish 2025-02-28 10:16:42 +08:00
Tong Li
ffd3878a1e add simple grpo 2025-02-23 22:54:26 +08:00
Tong Li
8e6c9a4ab3 add reward related function 2025-02-23 11:02:54 +08:00