YeAnbang
9cbc5dd924
upgrade reward functions
2025-08-05 14:01:20 +08:00
YeAnbang
6095274be6
support logging rollouts to wandb
2025-08-05 14:01:20 +08:00
YeAnbang
654aefc3c3
address conversation
2025-08-05 14:01:18 +08:00
YeAnbang
e7f61be51a
fix evaluation
2025-08-05 14:00:44 +08:00
Tong Li
6ebd813b5f
handle empty index
2025-08-05 14:00:43 +08:00
YeAnbang
88f49ddc5e
remove redundant code and fix bugs
2025-08-05 13:59:56 +08:00
YeAnbang
d19f1f21b6
move prompt-level-filtering to buffer side
2025-08-05 13:59:56 +08:00
YeAnbang
f79dbdb2df
move prompt-level-filtering to buffer side
2025-08-05 13:59:56 +08:00
YeAnbang
0d0fef771f
disable wandb tb syncing
2025-08-05 13:59:56 +08:00
YeAnbang
280aa0b830
use consumer global step
2025-08-05 13:59:56 +08:00
Tong Li
5a6e4a6d75
[feat] Support prompt level dynamic ( #6300 )
...
* adjust to dynamic prompt bs
* remove debug
* update pad seq (#6303 )
Co-authored-by: Tong Li <tong.li35271158@gmail.com>
* adjust to dynamic prompt bs
* remove debug
* fix dp issue
* fix
* fix default settings
---------
Co-authored-by: Tong Li <tong.li35271158@gmail.com>
2025-08-05 13:59:53 +08:00
YeAnbang
3416a4fc9c
move logging to producer
2025-08-05 13:59:03 +08:00
YeAnbang
af4366f0cb
Support evaluation during training
2025-08-05 13:59:03 +08:00
Tong Li
4ac7d065a6
update pad seq ( #6303 )
...
Co-authored-by: Tong Li <tong.li35271158@gmail.com>
2025-08-05 13:59:03 +08:00
YeAnbang
9544c51a74
[fix] revert reward update and evaluation ( #6295 )
...
* Revert "rewrite reward fn"
This reverts commit d06042b434
.
* Revert "upgrade reward math verification"
This reverts commit a6085ff676
.
* Revert "fix bug"
This reverts commit 01640ebd65
.
* Revert "reuse comm-group"
This reverts commit bd61918dcf
.
* Revert "Support evaluation during training"
This reverts commit 57a88395fe
.
2025-08-05 13:59:02 +08:00
YeAnbang
06b892bf4d
rewrite reward fn
2025-08-05 13:59:02 +08:00
YeAnbang
9642b75581
upgrade reward math verification
2025-08-05 13:59:02 +08:00
YeAnbang
1be993de3e
fix bug
2025-08-05 13:59:02 +08:00
YeAnbang
de0c267f5a
reuse comm-group
2025-08-05 13:59:02 +08:00
YeAnbang
16600f3509
Support evaluation during training
2025-08-05 13:59:02 +08:00
Tong Li
6a1bd833e0
[feat] Sync shard model ( #6289 )
...
* [feat] support hybrid parallel model sync
* update consumer and producer
* update files
* update producer
* remove print
* update
---------
Co-authored-by: duanjunwen <935724073@qq.com>
Co-authored-by: YeAnbang <44796419+YeAnbang@users.noreply.github.com>
Co-authored-by: Tong Li <tong.li35271158@gmail.com>
2025-08-05 13:59:02 +08:00
YeAnbang
e181318d51
[feat] Support boxed math reward ( #6284 )
...
* fix pp+tp, fix dataloader
* fixed plugin micro-batch size
* support boxed reward
* add boxed reward
* fix pp state dict incomplete issue
* Revert "fix pp state dict incomplete issue"
This reverts commit 6c1b3b694f
.
2025-08-05 13:59:02 +08:00
YeAnbang
fb4e507d00
fix pp+tp, fix dataloader ( #6280 )
2025-08-05 13:59:02 +08:00
Tong Li
37a8be7651
fix save issue ( #6279 )
...
Co-authored-by: Tong Li <tong.li35271158@gmail.com>
2025-08-05 13:59:02 +08:00
YeAnbang
673682e716
fix checkpoint naming; add num_epoch parameter ( #6277 )
2025-08-05 13:59:02 +08:00
YeAnbang
5f913e8b77
[feat] Support DAPO ( #6263 )
...
* update help information
* update style
* fix
* minor fix
* support PP training
* add pp support
* remove unused code
* address conversation
* fix memory leakage support tp+pp
* move empty cache
* move empty cache
* add DAPO support
* remove format reward
* fix filtering, still buggy
* small fix
* add DAPO support
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* tested multi-node training; fix bind_batch bug
* fix conversation; support sleep mode
* support reusing excessive samples
* add dynamic batching control flag
* add dynamic batching control flag
* refactored
* fix logging
---------
Co-authored-by: Tong Li <tong.li35271158@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2025-08-05 13:59:02 +08:00
Tong Li
b34d707cdc
[feat] Add final save at the end ( #6274 )
...
* add final save
* default 1 episode
2025-08-05 13:59:02 +08:00
Tong Li
befd4f1487
add prompt template ( #6273 )
...
Co-authored-by: Tong Li <tong.li35271158@gmail.com>
2025-08-05 13:59:02 +08:00
YeAnbang
3bd6fa3c67
[hot-fix] Fix memory leakage bug, support TP+PP ( #6258 )
...
* update help information
* update style
* fix
* minor fix
* support PP training
* add pp support
* remove unused code
* address conversation
* fix memory leakage support tp+pp
* move empty cache
* move empty cache
---------
Co-authored-by: Tong Li <tong.li35271158@gmail.com>
2025-08-05 13:59:02 +08:00
YeAnbang
5d79b9e692
[Distributed RLHF] Integration of PP ( #6257 )
...
* update help information
* update style
* fix
* minor fix
* support PP training
* add pp support
* remove unused code
* address conversation
---------
Co-authored-by: Tong Li <tong.li35271158@gmail.com>
2025-08-05 13:59:02 +08:00
YeAnbang
12da4d14aa
[feat] add microbatch forwarding ( #6251 )
...
* add microbatch forwarding
* fix forward microbatch
* fix producer OOM
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* change project name
* fix temperature annealing
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* address conversation
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2025-08-05 13:59:02 +08:00
YeAnbang
c627b60551
update logging
2025-08-05 13:59:02 +08:00
YeAnbang
23aac43dcf
simplify vllm preprocessing input ids
2025-08-05 13:59:02 +08:00
YeAnbang
16e68a071d
fix logprob, add filtering, temperature annealing, lr descent
2025-08-05 13:59:02 +08:00
YeAnbang
f983071b10
fix vllm
2025-08-05 13:59:02 +08:00
duanjunwen
455185345e
[Feature] Support Distributed LogProb for GRPO Training ( #6247 )
...
* [fix] fix qwen VocabParallelLMHead1D and gather output
* fix tp bug
* fix consumer
* [feat] Support Distributed LogProb for GRPO Training
* [fix] fix loss func
* [fix] fix log prob plugin
* [fix] fix qwen modeling param
* [fix] rm comments
* [fix] rm hard-code;fix non-dist version
* [fix] fix test file param name and benchmark tp gather output=True/False
* [fix] rm non-dist version in dist log prob
* [fix] fix comments
* [fix] fix dis log prob plugin
* [fix] fix test case
* [fix] fix qwen VocabParallelLMHead1D and gather output
* [fix] fix DistLogProb comments
* [fix] restore tp size
* [fix] fix comments
* [fix] fix comment; fix LogSoftmax usage
---------
Co-authored-by: Tong Li <tong.li35271158@gmail.com>
2025-08-05 13:59:02 +08:00
YeAnbang
35dabd718e
fix transformers backend
2025-08-05 13:59:02 +08:00
Tong Li
e224673c44
setup update
2025-08-05 13:59:02 +08:00
Tong Li
bfc45829c3
print results
2025-08-05 13:59:02 +08:00
Tong Li
30c7ddd9f1
convert to 8 generation
2025-08-05 13:59:02 +08:00
Tong Li
a2ae82a417
fix consumer
2025-08-05 13:59:02 +08:00
Tong Li
b19355f8f0
fix tp bug
2025-08-05 13:59:02 +08:00
Tong Li
69a1a325ee
detach
2025-08-05 13:59:02 +08:00
Tong Li
b951d0b224
add response length
2025-08-05 13:59:02 +08:00
Tong Li
a4862a2349
fix reward score
2025-08-05 13:59:02 +08:00
Tong Li
a537aa1c20
update reward
2025-08-05 13:59:02 +08:00
Tong Li
c8db826782
update reward fn
2025-08-05 13:59:02 +08:00
Tong Li
fe017d34c5
update grpo
2025-08-05 13:59:02 +08:00
pre-commit-ci[bot]
bc538ba049
[pre-commit.ci] auto fixes from pre-commit.com hooks
...
for more information, see https://pre-commit.ci
2025-08-05 13:59:02 +08:00
pre-commit-ci[bot]
f71d422690
[pre-commit.ci] auto fixes from pre-commit.com hooks
...
for more information, see https://pre-commit.ci
2025-08-05 13:59:01 +08:00