Commit Graph

13 Commits

Author SHA1 Message Date
Tong Li
4ac7d065a6 update pad seq (#6303)
Co-authored-by: Tong Li <tong.li35271158@gmail.com>
2025-08-05 13:59:03 +08:00
YeAnbang
9544c51a74 [fix] revert reward update and evaluation (#6295)
* Revert "rewrite reward fn"

This reverts commit d06042b434.

* Revert "upgrade reward math verification"

This reverts commit a6085ff676.

* Revert "fix bug"

This reverts commit 01640ebd65.

* Revert "reuse comm-group"

This reverts commit bd61918dcf.

* Revert "Support evaluation during training"

This reverts commit 57a88395fe.
2025-08-05 13:59:02 +08:00
YeAnbang
16600f3509 Support evaluation during training 2025-08-05 13:59:02 +08:00
YeAnbang
5f913e8b77 [feat] Support DAPO (#6263)
* update help information

* update style

* fix

* minor fix

* support PP training

* add pp support

* remove unused code

* address conversation

* fix memory leakage support tp+pp

* move empty cache

* move empty cache

* add DAPO support

* remove format reward

* fix filtering, still buggy

* small fix

* add DAPO support

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* tested multi-node training; fix bind_batch bug

* fix conversation; support sleep mode

* support reusing excessive samples

* add dynamic batching control flag

* add dynamic batching control flag

* refactored

* fix logging

---------

Co-authored-by: Tong Li <tong.li35271158@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2025-08-05 13:59:02 +08:00
YeAnbang
23aac43dcf simplify vllm preprocessing input ids 2025-08-05 13:59:02 +08:00
YeAnbang
16e68a071d fix logprob, add filtering, temperature annealing, lr descent 2025-08-05 13:59:02 +08:00
YeAnbang
f983071b10 fix vllm 2025-08-05 13:59:02 +08:00
YeAnbang
35dabd718e fix transformers backend 2025-08-05 13:59:02 +08:00
Tong Li
30c7ddd9f1 convert to 8 generation 2025-08-05 13:59:02 +08:00
Tong Li
718c4b76cc polish 2025-08-05 13:59:01 +08:00
Tong Li
40d601802d add simple grpo 2025-08-05 13:59:01 +08:00
Hongxin Liu
7a2d455136 [feature] fit RL style generation (#6213)
* [feature] fit rl style generation

* [doc] add docstr

* [doc] add docstr
2025-08-05 13:59:01 +08:00
Hongxin Liu
162bb42321 [chat] add distributed impl (#6210) 2025-08-05 13:59:01 +08:00