Lumeng Wu 09 May, 2025

Codes: https://github.com/dirtyDan0/VerboseLengthReduction

Experiment logs: https://api.wandb.ai/links/dirtydan0/4qe871y8

<aside> 💥

TL;DR: Our results show that, despite a plateau in RL training, LLMs exhibit continued improvement as extreme over-generation decreases.

</aside>

Recent studies [1] [2] have pointed out that the performance of reinforcement learning (RL) on large language models (LLMs) is constrained by the inherent ability of the base model. Consequently, when the model size is limited and the training data is insufficient, training tends to plateau early.

In line with several recent works [3] [4] [5], we perform reinforcement learning (RL) directly on the base model. Using Qwen2.5-3B, we train for one epoch (93 steps per run) on MATH12k and evaluate on MATH500, experimenting with various configurations involving different reward functions and advantage estimators (REINFORCE, GRPO, RLOO). Across all configurations, we observe an early training plateau: metrics such as average score, test score, and pass@4 cease improving before step 30. This observation initially suggested that training had saturated, with little room for further gains. Surprisingly, however, we find that:

<aside> 📢

  1. Although general performance metrics plateau, the LLM continues to improve the quality of its responses by reducing unnecessary verbosity after producing the final answer. This reduction occurs spontaneously without any explicit intervention.
  2. The major source of verbosity comes from a small proportion of extremely verbose responses, which become increasingly rare as training progresses.
  3. One contributing factor to such verbosity is the inclusion of code content—responses that contain code tend to be more verbose. Interestingly, the presence of code in responses also diminishes during training, further contributing to the overall reduction in verbosity. </aside>

Basic Settings and Runs

We use verl for training.

All experiments are conducted on 2×H20, with each run taking less than 2.5 hours for 1 epoch (93 steps). The max response length is 1500. For each question, we sample 4 answers to enable the computation of pass@4 as well as support GRPO with a group size 4. We adopt token-mean, which normalizes the loss of each token by the number of valid tokens in a micro batch.

We prompt the LLM to put its answer in \\boxed{}, allowing all answers to be categorized into the following three results:

Result Meaning Default Score
wrong no \\boxed{} detected 0
format-only \\boxed{} present, but the answer is incorrect 0
correct \\boxed{} present and the answer is correct 1

Unless otherwise specified, the default score is 1 for correct and 0 for all other cases.

We conducted our initial experiments with REINFORCE, GRPO, and RLOO. However, all three methods saturated around step 30 and exhibited fluctuations throughout the remainder of training.

Figure 1 | Results of the initial experiments. All runs plateaued before step 30.

Figure 1 | Results of the initial experiments. All runs plateaued before step 30.

Since these metrics indicated that training had already saturated, we began to investigate whether other aspects of the model’s behavior continued to evolve during the later stages of training. This prompted a closer analysis of the model’s output responses. We observed that some responses exhibited over-generation—that is, the model produced excessive and often irrelevant content beyond what was necessary. To explore whether this over-generation behavior also saturated early, we devised a method to quantify its extent and performed a corresponding analysis.

Method

Definition of Verbose Length

A natural approach to measuring over-generation is by examining the content length after the answer tag \\boxed{answer} , as we observed that the model sometimes began generating irrelevant content following the answer.

<aside> ℹ️

Therefore, we define verbose length as the number of tokens following \\boxed{answer} , in order to quantify the extent of verbosity.

Note that verbose length can only be calculated for correct and format-only responses as it relies on parsing \\boxed{answer}.

</aside>