Lumeng Wu 09 May, 2025
Codes: https://github.com/dirtyDan0/VerboseLengthReduction
Experiment logs: https://api.wandb.ai/links/dirtydan0/4qe871y8
<aside> 💥
TL;DR: Our results show that, despite a plateau in RL training, LLMs exhibit continued improvement as extreme over-generation decreases.
</aside>
Recent studies [1] [2] have pointed out that the performance of reinforcement learning (RL) on large language models (LLMs) is constrained by the inherent ability of the base model. Consequently, when the model size is limited and the training data is insufficient, training tends to plateau early.
In line with several recent works [3] [4] [5], we perform reinforcement learning (RL) directly on the base model. Using Qwen2.5-3B, we train for one epoch (93 steps per run) on MATH12k and evaluate on MATH500, experimenting with various configurations involving different reward functions and advantage estimators (REINFORCE, GRPO, RLOO). Across all configurations, we observe an early training plateau: metrics such as average score, test score, and pass@4 cease improving before step 30. This observation initially suggested that training had saturated, with little room for further gains. Surprisingly, however, we find that:
<aside> 📢
We use verl for training.
All experiments are conducted on 2×H20, with each run taking less than 2.5 hours for 1 epoch (93 steps). The max response length is 1500. For each question, we sample 4 answers to enable the computation of pass@4 as well as support GRPO with a group size 4. We adopt token-mean, which normalizes the loss of each token by the number of valid tokens in a micro batch.
We prompt the LLM to put its answer in \\boxed{}
, allowing all answers to be categorized into the following three results:
Result | Meaning | Default Score |
---|---|---|
wrong | no \\boxed{} detected |
0 |
format-only | \\boxed{} present, but the answer is incorrect |
0 |
correct | \\boxed{} present and the answer is correct |
1 |
Unless otherwise specified, the default score is 1 for correct
and 0 for all other cases.
We conducted our initial experiments with REINFORCE, GRPO, and RLOO. However, all three methods saturated around step 30 and exhibited fluctuations throughout the remainder of training.
Figure 1 | Results of the initial experiments. All runs plateaued before step 30.
Since these metrics indicated that training had already saturated, we began to investigate whether other aspects of the model’s behavior continued to evolve during the later stages of training. This prompted a closer analysis of the model’s output responses. We observed that some responses exhibited over-generation—that is, the model produced excessive and often irrelevant content beyond what was necessary. To explore whether this over-generation behavior also saturated early, we devised a method to quantify its extent and performed a corresponding analysis.
A natural approach to measuring over-generation is by examining the content length after the answer tag \\boxed{answer}
, as we observed that the model sometimes began generating irrelevant content following the answer.
<aside> ℹ️
Therefore, we define verbose length
as the number of tokens following \\boxed{answer}
, in order to quantify the extent of verbosity.
Note that verbose length can only be calculated for correct and format-only responses as it relies on parsing \\boxed{answer}
.
</aside>