Reinforcement Learning in LLM

Transformer
Large Language Model
Reinforcement Learning
In this blog, we will explore how reinforcement learning is applied in large language models (LLMs) to enhance their performance and adaptability. It will start with mapping LLMs to the RL framework, then we will dive into different RL algorithms used in LLMs (RLHF), such as PPO and DPO. Than we will explore the the RLVR(Reinforcement Learning with Verifiable Rewards) technique and explore the algorithms such as GRPO, GSPO, DAPO, SAPO and the most SOTA GDPO. Than we will discuss current LLM-RL training framework such as Verl. Finally, we will discuss the challenges and future directions of applying RL in LLMs.
Author

Yuyang Zhang

Published

2026-01-25

Modified

2026-02-02

1 Preliminaries

In this section, we will review some fundamental concepts and notations that are essential for understanding reinforcement learning (RL) and its application in large language models (LLMs). We will cover key topics such as entropy, KL-Divergence, Monte Carlo estimation, variance reduction techniques, and importance sampling. These concepts will provide a solid foundation for the subsequent discussions on RL algorithms and their integration with LLMs. For readers who are already familiar with these topics, feel free to skip this section and proceed to the next part of the blog.

1.1 Entropy

Entropy is a measure of uncertainty or randomness in a probability distribution. In the context of reinforcement learning, entropy is often used to encourage exploration by promoting diverse action selection. The entropy of a discrete probability distribution \(p(x)\) is defined as:

\[ H(p) = - \sum_{x} p(x) \log p(x) \tag{1}\]

For continuous distributions, the entropy is defined as: \[ H(p) = - \int p(x) \log p(x) dx \tag{2}\]

One thing to note is that:

  • Higher entropy indicates more uncertainty and exploration
  • Lower entropy indicates more certainty and exploitation

For the discrete case, the entropy is maximized when the distribution is uniform \(p(x) = \frac{1}{|X|}\), i.e., all outcomes are equally likely. In contrast, the entropy is minimized (zero) when the distribution is deterministic, i.e., one outcome has probability 1 and all others have probability 0.

1.2 KL-Divergence

The Kullback-Leibler (KL) Divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. In reinforcement learning, KL-Divergence is often used to quantify the difference between two policies, which is useful for policy optimization and regularization. The KL-Divergence from distribution \(P\) to distribution \(Q\) is defined as: \[ D_{KL}(P || Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} \tag{3}\]

For continuous distributions, the KL-Divergence is defined as: \[ D_{KL}(P || Q) = \int P(x) \log \frac{P(x)}{Q(x)} dx \tag{4}\]

The KL-Divergence has several important properties:

  • Non-negativity: \(D_{KL}(P || Q) \geq 0\), with equality if and only if \(P = Q\) almost everywhere.
  • Asymmetry: \(D_{KL}(P || Q) \neq D_{KL}(Q || P)\) in general.
def compute_kl_divergence(P: torch.Tensor, Q: torch.Tensor) -> torch.Tensor:
    kl_div = torch.sum(P * torch.log(P / Q))
    return kl_div

NOTE Further Reading about Entropy & KL Divergence

For those who want to dive deeper into Entropy and KL-Divergence, I highly recommend to check out my previous blog: From Entropy to KL Divergence: A Comprehensive Guide

1.3 Monte Carlo Estimation

In the practice of machine learning and reinforcement learning, we often encounter situations where we need to compute expectations of functions with respect to complex probability distributions. Monte Carlo estimation is a powerful technique that allows us to approximate these expectations using random sampling. The basic idea is to draw samples from the target distribution and use them to compute an empirical average of the function of interest. Due to the law of large numbers, this empirical average converges to the true expectation as the number of samples increases.

For example, suppose we want to estimate the expectation of a function \(f(x)\) with respect to a probability distribution \(p(x)\): \[ \begin{split} \mathbb{E}_{x \sim p(x)}[f(x)] & = \int f(x) p(x) dx \\ & \approx \frac{1}{N} \sum_{i=1}^{N} f(x^{(i)}) \quad \text{where } x^{(i)} \sim p(x) \end{split} \tag{5}\]

This is the foundation of many algorithms in reinforcement learning, where we need to estimate expected returns, value functions, and policy gradients based on sampled trajectories.

For the KL-Divergence between two distributions \(P\) and \(Q\), we can use Monte Carlo estimation to approximate it as follows: \[ D_{KL}(P || Q) = \mathbb{E}_{x \sim P} \left[ \log \frac{P(x)}{Q(x)} \right] \approx \frac{1}{N} \sum_{i=1}^{N} \log \frac{P(x^{(i)})}{Q(x^{(i)})} \quad \text{where } x^{(i)} \sim P(x) \tag{6}\]

1.4 Varaince Reduction Trick

One thing to note about Monte Carlo estimation is that it can suffer from high variance, especially when the function being estimated has high variability or when the number of samples is limited. To mitigate this issue, various variance reduction techniques can be employed, such as:

  • Control Variates: This technique involves introducing a control variate, which is a function with a known expectation that is correlated with the function of interest. By adjusting the estimate using the control variate, we can reduce the variance of the estimator.
  • Antithetic Variates: This method involves generating pairs of negatively correlated samples to reduce variance. By averaging the results from these pairs, we can achieve a more stable estimate.

We can also subtract a baseline from the function being estimated to reduce variance without introducing bias. For example, when estimating the expected return in reinforcement learning, we can subtract a baseline value \(b\) from the return: \[ \mathbb{E}_{x \sim p(x)}[f(x)] = \mathbb{E}_{x \sim p(x)}[f(x) - b] + b \tag{7}\]

Where \(b\) is a constant or a function that does not depend on \(x\). This adjustment helps to center the estimates around the baseline, reducing variance while maintaining the unbiasedness of the estimator.

1.5 Importance Sampling

When we want to estimate an expectation with respect to a target distribution \(p(x)\), but we can only sample from a different distribution \(q(x)\) (known as the proposal distribution), we can use importance sampling to correct for the discrepancy between the two distributions. The key idea is to reweight the samples drawn from the proposal distribution by the ratio of the target and proposal densities. The expectation of a function \(f(x)\) with respect to the target distribution \(p(x)\) can be expressed as: \[ \mathbb{E}_{x \sim p(x)}[f(x)] = \int f(x) p(x) dx = \int f(x) \frac{p(x)}{q(x)} q(x) dx = \mathbb{E}_{x \sim q(x)}\left[ f(x) \frac{p(x)}{q(x)} \right] \tag{8}\]

Using Monte Carlo estimation, we can approximate this expectation using samples drawn from the proposal distribution \(q(x)\): \[ \mathbb{E}_{x \sim p(x)}[f(x)] \approx \frac{1}{N} \sum_{i=1}^{N} f(x^{(i)}) \frac{p(x^{(i)})}{q(x^{(i)})} \quad \text{where } x^{(i)} \sim q(x) \tag{9}\]

where \(\frac{p(x^{(i)})}{q(x^{(i)})}\) is known as the importance weight / importance ratio. Importance sampling is particularly useful in reinforcement learning when we want to evaluate or improve a policy using data collected from a different policy.

2 Notations

There are various notations used in the literature of reinforcement learning (RL) and large language models (LLMs). To maintain consistency and clarity throughout this blog, we will define and use a standard set of notations as following:

Notation Meaning Description
\(\pi_{\theta}\) Policy A mapping function from states to a probability distribution over actions.
\(s_t\) State at time \(t\) The current context or input to the LLM at time step \(t\). This could include the text generated so far and any other relevant information.
\(a_t\) Action at time \(t\) The token or word generated by the LLM at time step \(t\).
\(r_t\) Reward at time \(t\) A scalar value received after taking action \(a_t\) in state \(s_t\). It indicates the quality of the generated token in the context of the overall text generation task.
\(R_t\) Return at time \(t\) The cumulative reward received from time step \(t\) onwards. It is often used to evaluate the long-term effectiveness of the policy.
\(\gamma\) Discount factor A value between 0 and 1 that determines the importance of future rewards. A higher value places more emphasis on future rewards.
\(V^{\pi}(s)\) Value function / State-value function The expected return when starting from state \(s\) and following policy \(\pi\). It measures the long-term value of being in state \(s\).
\(Q^{\pi}(s, a)\) Action-value function The expected return when starting from state \(s\), taking action \(a\), and thereafter following policy \(\pi\). It evaluates the quality of taking action \(a\) in state \(s\).
\(A^{\pi}(s, a)\) Advantage function The difference between the action-value function and the value function: \(A^{\pi}(s, a) = Q^{\pi}(s, a) - V^{\pi}(s)\). It indicates how much better or worse taking action \(a\) in state \(s\) is compared to the average action.
\(H(p)\) Entropy of distribution \(p\) A measure of uncertainty or randomness in the policy’s action selection. Higher entropy indicates more exploration.
\(\tau\) Trajectory / Episode / Rollout A sequence of states, actions, and rewards generated by following a policy from an initial state to a terminal state. Can be viewd as a sequence of state and action pairs: \(\tau = (s_0, a_0, s_1, a_1, ..., s_T, a_T)\)
\(\rho(\tau)\) Importance Sampling Ratio The ratio of probabilities of a trajectory under two different policies: \(\frac{\pi_{\theta}(\tau)}{\pi_{\theta_{old}}(\tau)}\)
Table 1: Notations in RL and LLMs

3 Review of Reinforcement Learning

In this section, we will provide a brief overview of key reinforcement learning (RL) concepts that are relevant to understanding how RL is applied in large language models (LLMs). We will cover the fundamental components of RL, including the agent, environment, states, actions, rewards, and policies. Additionally, we will discuss common RL algorithms and techniques that are often employed in the context of LLMs.

Note

For those who what to dig deeper into RL, I highly recommend the following resources:

  • Reinforcement learning: An introduction: A comprehensive textbook by Sutton and Barto that covers the fundamentals of RL.
  • Stanford CS234 Reinforcement Learning: This course provides lectures and video materials on various RL topics, it can be seen as complementary to the Sutton and Barto book.
  • UCB CS285 Deep Reinforcement Learning: This course focuses on deep reinforcement learning techniques and their applications. (The first several lectures cover basic Deep RL concepts and Policy Gradient methods, the review section in this blog is heavily inspired by these lectures)

Reinforcement Learning (RL) is a way to train an “agent” to make good decisions by trying actions, getting feedback (rewards), and learning a strategy (policy) that maximizes long-term reward.

Figure 1: The overall framework of Reinforcement Learning. The agent interacts with the environment by taking action \(a_t = \pi(s_t)\) based on its current state \(s_t\), receiving rewards \(r_t\), and transitioning to new states with probability \(p(s_{t+1} | s_t, a_t)\).

The goal of reinforcement learning is to find an optimal policy \(\pi^*\) that maximizes the expected cumulative reward (return) over time. Let’s first define the objective function:

\[ J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[\sum_{t=0}^{T} r_t \right] \tag{10}\]

Where:

  • \(\tau\) represents a trajectory (sequence of states, actions) generated by following policy \(\pi\).
  • \(\gamma\) is the discount factor that determines the importance of future rewards.

Let’s expand the trajectory expectation:

\[ p(\tau | \pi) = p(s_0) \prod_{t=0}^{T} \pi(a_t | s_t) p(s_{t+1} | s_t, a_t) \tag{11}\]

Where:

  • \(p(s_0)\) is the initial state distribution.
  • \(\pi(a_t | s_t)\) is the policy’s probability of taking action \(a_t\) in state \(s_t\).
  • \(p(s_{t+1} | s_t, a_t)\) is the environment’s transition probability from state \(s_t\) to \(s_{t+1}\) given action \(a_t\). We usually do not have access to this transition probability in model-free RL.

We know that the policy \(\pi_{\theta}\) is parameterized by \(\theta\) (e.g., neural network weights). The goal of RL is to optimize the policy parameters \(\theta\) to maximize the expected return \(J(\pi_{\theta})\). This is typically done using gradient-based optimization methods, where we compute the gradient of the objective function with respect to the policy parameters:

\[ \nabla_{\theta} J(\pi_{\theta}) = \nabla_{\theta} \mathbb{E}_{\tau \sim \pi_{\theta}} \left[\sum_{t=0}^{T} r_t \right] \tag{12}\]

However, we cannot get the gradient directly because:

  1. The expectation is over trajectories \(\tau\), which depend on the policy \(\pi_{\theta}\).
  2. The environment dynamics (transition probabilities) are usually unknown.

To address this, we can use the Log-Derivative Trick (also known as the Score Function Estimator) to rewrite the gradient:

\[ \nabla_{\theta} P_{\theta}(x) = P_{\theta}(x) \nabla_{\theta} \log P_{\theta}(x) \tag{13}\]

Applying this to our policy gradient, we have:

\[ \begin{split} \nabla_{\theta} J(\pi_{\theta}) & = \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \nabla_{\theta} \log \pi_{\theta}(\tau) \sum_{t=0}^{T} r_t \right] \\ & = \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \nabla_{\theta} \log \pi_{\theta}(\tau) \sum_{t=0}^{T} r_t \right] \end{split} \tag{14}\]

Where: - \(\log \pi_{\theta}(\tau)\) is the log-probability of the trajectory \(\tau\) under the policy \(\pi_{\theta}\). - \(\sum_{t=0}^{T} r_t\) is the cumulative reward for the trajectory.

Plugging in the trajectory probability from Equation {#eq-trajectory-probability}, we can further expand the policy gradient:

\[ \nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \left( \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) \right) \left( \sum_{t=0}^{T} r_t \right) \right] \tag{15}\]

To get this expectation, we can use Monte Carlo estimation (Section 1.3) by sampling trajectories from the current policy \(\pi_{\theta}\) and computing the empirical average:

\[ \nabla_{\theta} J(\pi_{\theta}) \approx \frac{1}{N} \sum_{i=1}^{N} \left[ \left( \sum_{t=0}^{T} \textcolor{cyan}{\nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} | s_t^{(i)})} \right) \left( \sum_{t=0}^{T} r_t^{(i)} \right) \right] \tag{16}\]

WAIT!! This look familiaer? Yes, in the Deep Learning 101. We know the classification loss gradient can be written as: \[ \begin{split} \nabla_{\theta} \mathcal{L}_{\text{classification}} & = - \mathbb{E}_{(x, y) \sim \mathcal{D}} \left[ \nabla_{\theta} \log p_{\theta}(y | x) \right] \\ & \approx - \frac{1}{N} \sum_{i=1}^{N} \textcolor{cyan}{\nabla_{\theta} \log p_{\theta}(y^{(i)} | x^{(i)})} \end{split} \tag{17}\]

Where \(p_{\theta}(y|x)\) is the model’s predicted probability for label \(y\) given input \(x\).

The similarity between Equation (Equation 16) and Equation (Equation 17) highlights that both RL policy gradients and supervised learning gradients can be computed using log-probabilities of actions/labels weighted by rewards/losses. This connection allows us to leverage techniques from supervised learning, such as stochastic gradient descent, in the context of reinforcement learning.

This form the most foundation of policy gradient methods in RL, the algorithms to optimize the policy parameters \(\theta\) by is called REINFORCE algorithm.

However, the basic REINFORCE algorithms suffers from several issues, one of the major issue is the high variance of the gradient estimates, which can lead to unstable and slow learning. To address this, several variance reduction techniques have been developed, such as using baselines (e.g., value functions) to reduce variance without introducing bias.

\[ \nabla_{\theta} J(\pi_{\theta}) \approx \frac{1}{N} \sum_{i=1}^{N} \left[ \left( \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} | s_t^{(i)}) \right) \left( \sum_{t=0}^{T} r_t^{(i)} - b(s_t^{(i)}) \right) \right] \tag{18}\]

Why this is Unbiased?

The baseline \(b(s_t)\) does not depend on the action \(a_t\), so it does not introduce bias into the gradient estimate. When we take the expectation over the trajectories, the term involving the baseline cancels out because:

\[ \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) b(s_t) \right] = b(s_t) \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) \right] = 0 \]

We are safe to subtract any baseline without introducing bias as long as it does not depend on the action \(a_t\). A common choice for the baseline is the value function \(V^{\pi}(s_t)\), which estimates the expected return from state \(s_t\) under policy \(\pi\). Using the value function as a baseline helps to reduce variance by centering the rewards around their expected values.

\[ \nabla_{\theta} J(\pi_{\theta}) \approx \frac{1}{N} \sum_{i=1}^{N} \left[ \left( \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} | s_t^{(i)}) \right) \left( \sum_{t=0}^{T} r_t^{(i)} - V^{\pi}(s_t^{(i)}) \right) \right] \tag{19}\]

The other thing to can use to reduce variance is by applying causality, which means that an action at time step \(t\) can only affect future rewards, not past rewards. Therefore, we can rewrite the policy gradient to only consider future rewards from time step \(t\) onwards:

\[ \nabla_{\theta} J(\pi_{\theta}) \approx \frac{1}{N} \sum_{i=1}^{N} \left[ \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} | s_t^{(i)}) \left( \sum_{t'=t}^{T} r_{t'}^{(i)} - V^{\pi}(s_t^{(i)}) \right) \right] \tag{20}\]

Here, we give \(\sum_{t'=t}^{T} r_{t'}\) a special name called reward to go(also known as return) \(R_t\):

\[ R_t = \sum_{t'=t}^{T} r_{t'} \tag{21}\]

Take a close look at reward to go \(R_t\), we can see that it is an estimate of expected reward if we take action \(a_t\) at state \(s_t\) and follow policy \(\pi\) thereafter. This is exactly the definition of action-value function \(Q^{\pi}(s_t, a_t)\):

\[ Q^{\pi}(s_t, a_t) = \mathbb{E}_{\tau \sim \pi} \left[ R_t | s_t, a_t \right] \tag{22}\]

So, we can further rewrite the policy gradient as:

\[ \nabla_{\theta} J(\pi_{\theta}) \approx \frac{1}{N} \sum_{i=1}^{N} \left[ \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} | s_t^{(i)}) \left( Q^{\pi}(s_t^{(i)}, a_t^{(i)}) - V^{\pi}(s_t^{(i)}) \right) \right] \tag{23}\]

Where the term in the parentheses is known as the advantage function \(A^{\pi}(s_t, a_t)\): \[ A^{\pi}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t) \tag{24}\]

And we know the action-value function \(Q^{\pi}(s_t, a_t)\) and value function \(V^{\pi}(s_t)\) can be estimated using various methods, such as Monte Carlo estimation, Temporal Difference (TD) learning, or function approximation (e.g., neural networks), according to the Ballman equation, the value function and action-value function satisfy the following relationships:

\[ V^{\pi}(s_t) = \mathbb{E}_{a_t \sim \pi} \left[ Q^{\pi}(s_t, a_t) \right] \tag{25}\]

And \[ Q^{\pi}(s_t, a_t) = \mathbb{E}_{s_{t+1} \sim p} \left[ r_t + \gamma V^{\pi}(s_{t+1}) \right] \tag{26}\]

So, to train an RL agent using policy gradient methods, we typically follow these steps:

  1. Collect Trajectories: Sample trajectories by executing the current policy \(\pi_{\theta}\) in the environment.
  2. Estimate Returns: Compute the returns \(R_t\) for each time step in the collected trajectories.
  3. Estimate Value Functions: Estimate the value function \(V^{\pi}(s_t)\) and action-value function \(Q^{\pi}(s_t, a_t)\) using the collected data.
  4. Compute Policy Gradient: Use the estimated value functions to compute the policy gradient using Equation {#eq-policy-gradient-with-advantage}.
  5. Update Policy Parameters: Update the policy parameters \(\theta\) using gradient ascent: \[ \theta \leftarrow \theta + \alpha \nabla_{\theta} J(\pi_{\theta}) \] Where \(\alpha\) is the learning rate.

This is known as Actor-Critic method, where the policy (actor) is updated using value function estimates (critic) to reduce variance and improve learning stability.

3.1 On Policy vs Off Policy

So far, we have discussed policy gradient methods in the context of on-policy learning, where the policy used to collect data (trajectories) is the same as the policy being optimized. However, there are also off-policy methods, where the data is collected using a different policy (behavior policy) than the one being optimized (target policy). The goal of off-policy methods is to leverage data collected from various policies to improve learning efficiency and stability. The main idea behind off-policy learning is to use importance sampling (Section 1.5) to correct for the distribution mismatch between the behavior policy and the target policy.

Let’s re-write the policy gradient using importance sampling, what we have is the state action reward pairs collected from the old policy \(\pi_{\theta_{old}}\), but we want to optimize the new policy \(\pi_{\theta}\):

\[ \begin{split} \nabla_{\theta} J(\pi_{\theta}) & = \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \nabla_{\theta} \log \pi_{\theta}(\tau) \sum_{t=0}^{T} r_t \right] \\ & = \mathbb{E}_{\tau \sim \pi_{\theta_{old}}} \left[ \underbrace{\frac{\pi_{\theta}(\tau)}{\pi_{\theta_{old}}(\tau)}}_{\text{Importance Ratio}} \nabla_{\theta} \log \pi_{\theta}(\tau) \sum_{t=0}^{T} r_t \right] \end{split} \tag{27}\]

Let’s expand the importance ratio: \[ \frac{\pi_{\theta}(\tau)}{\pi_{\theta_{old}}(\tau)} = \prod_{t=0}^{T} \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)} \tag{28}\]

Plugging this back to Equation 27, we have: \[ \begin{split} \nabla_{\theta} J(\pi_{\theta}) &= \mathbb{E}_{\tau \sim \pi_{\theta_{old}}} \left[ \left( \prod_{t=0}^{T} \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)} \right) \left( \nabla_{\theta} \log \pi_{\theta}(\tau) \sum_{t=0}^{T} r_t \right) \right] \\ & \approx \frac{1}{N} \sum_{i=1}^{N} \left[ \left( \prod_{t=0}^{T} \frac{\pi_{\theta}(a_t^{(i)} | s_t^{(i)})}{\pi_{\theta_{old}}(a_t^{(i)} | s_t^{(i)})} \right) \left( \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} | s_t^{(i)}) \right) A^{(i)} \right] \end{split} \tag{29}\]

There are several challenges with off-policy policy gradient methods:

\begin{algorithm} \caption{REINFORCE (Monte Carlo Policy Gradient)} \begin{algorithmic} \Require Policy $\pi_\theta(a \mid s)$ with parameters $\theta$ \Require Learning rate $\alpha$ \While{not converged} \State Sample a trajectory $\tau = (s_0, a_0, r_0, \dots, s_T, a_T, r_T)$ by running $\pi_\theta$ \For{$t = 0$ \textbf{to} $T$} \State $R_t \gets \sum_{t' = t}^{T} r_{t'}$ \Comment{Reward-to-go} \State $\theta \gets \theta + \alpha \, \nabla_\theta \log \pi_\theta(a_t \mid s_t) \, R_t$ \EndFor \EndWhile \end{algorithmic} \end{algorithm}

As we can see, this algorithm Algorithm 1 to more

4 Mapping LLM Post Training as RL Problems

In the previous section, we have reviewed the fundamental concepts of reinforcement learning (RL) and policy gradient methods. Now, we will explore how these RL concepts can be applied to large language models (LLMs) in the context of post-training fine-tuning. We will discuss how to formulate the LLM fine-tuning process as an RL problem, including defining states, actions, rewards, and policies. This mapping will provide a foundation for understanding various RL algorithms used in LLMs, such as Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from Verified Rewards (RLVR).

4.1 State, Action, Policy in LLM

In the context of large language models (LLMs), we can map the components of reinforcement learning (RL) to the LLM fine-tuning process as follows:

  • State (\(s_t\)): The state at time step \(t\) can be defined as the current context or input to the LLM. This includes the text generated so far, any preceding prompts, and potentially other relevant information such as metadata or user preferences. The state encapsulates all the information that the model has access to when generating the next token.
  • Action (\(a_t\)): The action at time step \(t\) corresponds to the token or word generated by the LLM. The action space is typically the vocabulary of the language model, and the model selects an action based on its current policy (i.e., the probability distribution over the vocabulary given the current state).
  • Policy (\(\pi_{\theta}\)): The policy in the context of LLMs is represented by the language model itself, parameterized by \(\theta\). The policy defines the probability distribution over actions (tokens) given the current state (context). The policy can be expressed as \(\pi_{\theta}(a_t | s_t)\), which gives the probability of generating token \(a_t\) given the context \(s_t\).
  • Reward (\(r_t\)): The reward at time step \(t\) is a scalar value that indicates the quality of the generated token in the context of the overall text generation task. Rewards can be derived from various sources, such as human feedback, automated evaluation metrics, or other criteria that reflect the desired behavior of the LLM. The reward signal guides the learning process by providing feedback on the effectiveness of the generated tokens.

One thing good about LLM is that the environment dynamics (transition probabilities) are known and deterministic. Given a state \(s_t\) and an action \(a_t\), the next state \(s_{t+1}\) is simply the concatenation of the current context with the newly generated token \(s_{t + 1} = (s_t, a_t)\). This simplifies the RL formulation, as we do not need to learn or estimate the environment dynamics.

Because of this, we can focus on optimizing the policy (the LLM) directly based on the reward signals without worrying about modeling the environment transitions.

For example, at the begin of training, we have text(prompt) like: “The capital of France is”, this is our initial state \(s_0\).

  • \(s_0\) = “The capital of France is”
  • \(a_0 \sim \text{softmax}(f_{\theta}(s_0))\) = ” Paris”
  • \(s_1 = (s_0, a_0)\) = “The capital of France is Paris”
  • \(a_1 \sim \text{softmax}(f_{\theta}(s_1))\) = “.”
  • \(s_2 = (s_1, a_1)\) = “The capital of France is Paris.”

This process continues until the model generates a complete response or reaches a predefined stopping criterion, when the <eos> token is generated or the max length is reached. In the end, we got trajectory/episode/rollout \(\tau = (s_0, a_0, s_1, a_1, ..., s_T, a_T)\).

4.2 Reward Design Challenges in LLM

In reinforcement learning for large language models (LLMs), the reward function plays a crucial role in guiding the model’s learning process. The reward function defines the objectives and desired behaviors that the LLM should exhibit during text generation. However, designing effective reward functions for LLMs can be challenging:

  • Sparse Rewards: In many cases, rewards may only be provided at the end of a generated sequence, making it difficult to attribute credit to individual token (action) generations.
  • Ambiguous Objectives: The desired behavior of LLMs can be complex and multifaceted, making it hard to define a single reward function that captures all aspects of quality.
  • Human Feedback: When using human feedback as a reward signal, it can be noisy, inconsistent, and expensive to obtain.
  • Scalability: As LLMs generate long sequences, computing rewards for every token can be computationally expensive.
  • Exploration vs. Exploitation: Balancing the need for exploration (trying new token generations) and exploitation (generating high-reward tokens) is crucial for effective learning.

In the following sections, we will explore specific RL algorithms that have been applied to LLM fine-tuning, including Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from Verified Rewards (RLVR). We will discuss how these algorithms leverage the RL framework to improve the performance and alignment of LLMs with human preferences

5 Reinforcement Learning from Human Feedback(RLHF)

After Pre-Training and Supervised Fine-Tuning, Large Language Models(LLMs) can generate high-quality text, but they may not always align with human preferences or ethical guidelines. To address this, Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful technique to fine-tune LLMs using feedback from human evaluators. RLHF leverages reinforcement learning (RL) principles to optimize the LLM’s behavior based on human-provided rewards, guiding the model to generate text that better aligns with human values and expectations. (Ouyang et al. 2022)

Figure 2: Overview of the RLHF process used in InstructGPT. The process involves three main steps: (1) collecting human feedback on model-generated outputs, (2) training a reward model to predict human preferences, and (3) fine-tuning the LLM using reinforcement learning to optimize for the learned reward model. (We only focus on the RL part in this blog)

Before diving into the RLHF algorithms, let’s first take a look at the data collection process for human feedback. The human feedback data is typically collected through the following steps:

5.1 PPO

Following the foundational work of InstructGPT (Ouyang et al. 2022), Proximal Policy Optimization (PPO)(Schulman et al. 2017) has become one of the most widely used algorithms for implementing RLHF in large language models. We can see the PPO algorithm in the Figure 2. It has several steps:

  1. Collecting Human Feedback: Initially, the LLM generates responses to a set of prompts. Human evaluators then review these responses and provide feedback, typically in the form of rankings or ratings. This feedback serves as the basis for training a reward model.
  2. Training a Reward Model: A separate reward model is trained to predict the human feedback based on the generated responses. This model learns to assign higher scores to responses that align better with human preferences
  3. Fine-Tuning with PPO: The LLM is then fine-tuned using the PPO algorithm, which optimizes the policy (the LLM) to maximize the expected reward as predicted by the reward model. PPO uses a clipped objective function to ensure that policy updates are not too large, maintaining stability during training.

Here we just focus on the step 3, the PPO fine-tuning process. The PPO algorithm can be summarized as follows:

\begin{algorithm} \caption{Proximal Policy Optimization (PPO) for RLHF (LLMs)} \begin{algorithmic} \Require Initial policy $\pi_{\theta_{old}}$ with parameters $\theta_{old}$ \Require Reward model $R_{\phi}$ trained on human feedback \Require Reference policy $\pi_{\text{ref}}$ (e.g., SFT model) \Require Value function (critic) $V_{\psi}$ with parameters $\psi$ \Require Clipping parameter $\epsilon$, KL coefficient $\beta$ \Require Discount $\gamma$, GAE parameter $\lambda$ \Require Number of PPO epochs $K$, minibatch size $m$ \Require Learning rates $\alpha_{\theta}, \alpha_{\psi}$ \While{not converged} \State Sample a batch of prompts $\{x_i\}_{i=1}^{B}$ \State Roll out $\pi_{\theta_{old}}$ to get responses $\{y_i\}$ with token actions $\{a_{i,t}\}_{t=1}^{T_i}$ \State Compute old log-probs $\ell^{old}_{i,t} \gets \log \pi_{\theta_{old}}(a_{i,t}\mid x_i, a_{i,<t})$ \State Compute ref log-probs $\ell^{ref}_{i,t} \gets \log \pi_{\text{ref}}(a_{i,t}\mid x_i, a_{i,<t})$ \State Compute sequence reward $r^{RM}_i \gets R_{\phi}(x_i, y_i)$ \Comment{Define per-timestep rewards with KL penalty (token-level) + terminal reward} \For{$i = 1$ \textbf{to} $B$} \For{$t = 1$ \textbf{to} $T_i$} \State $\text{kl}_{i,t} \gets \ell^{old}_{i,t} - \ell^{ref}_{i,t}$ \State $r_{i,t} \gets -\beta \cdot \text{kl}_{i,t}$ \EndFor \State $r_{i,T_i} \gets r_{i,T_i} + r^{RM}_i$ \Comment{terminal reward added at last token} \EndFor \For{$i = 1$ \textbf{to} $B$} \Comment{Compute advantages via GAE using critic $V_{\psi}$} \State $A_{i,T_i+1} \gets 0$ \For{$t = T_i$ \textbf{down to} $1$} \State $\delta_{i,t} \gets r_{i,t} + \gamma V_{\psi}(s_{i,t+1}) - V_{\psi}(s_{i,t})$ \State $A_{i,t} \gets \delta_{i,t} + \gamma \lambda A_{i,t+1}$ \State $\hat{R}_{i,t} \gets A_{i,t} + V_{\psi}(s_{i,t})$ \Comment{return/target} \EndFor \EndFor \For{$epoch = 1$ \textbf{to} $K$} \Comment{Optimize policy and value for $K$ epochs over minibatches} \State Shuffle all token positions $(i,t)$ and split into minibatches of size $m$ \For{\textbf{each} minibatch $\mathcal{M}$} \State $\ell_{i,t} \gets \log \pi_{\theta}(a_{i,t}\mid s_{i,t}) \quad \forall (i,t)\in\mathcal{M}$ \State $\rho_{i,t} \gets \exp(\ell_{i,t} - \ell^{old}_{i,t})$ \State $\tilde{\rho}_{i,t} \gets \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon)$ \State $L^{\text{clip}}(\theta) \gets -\mathbb{E}_{(i,t)\in\mathcal{M}}\left[\min(\rho_{i,t}A_{i,t}, \tilde{\rho}_{i,t}A_{i,t})\right]$ \State $L^{V}(\psi) \gets \mathbb{E}_{(i,t)\in\mathcal{M}}\left[(V_{\psi}(s_{i,t}) - \hat{R}_{i,t})^2\right]$ \State $\theta \gets \theta - \alpha_{\theta}\nabla_{\theta} L^{\text{clip}}(\theta)$ \State $\psi \gets \psi - \alpha_{\psi}\nabla_{\psi} L^{V}(\psi)$ \EndFor \EndFor \State $\theta_{old} \gets \theta$ \Comment{sync old policy for next rollout} \EndWhile \end{algorithmic} \end{algorithm}

5.2 DPO

Direct Preference Optimization (DPO) (Rafailov et al. 2024) is a novel approach for fine-tuning large language models using human feedback without the need for a separate reward model.

Figure 3: Overview of the Direct Preference Optimization (DPO) process. DPO directly optimizes the policy using human preference data by adjusting the model’s logits based on the preference between pairs of responses, eliminating the need for a separate reward model.

5.3 REINFORCE Leave One-Out (RLOO)

Rein

5.4 ReMax

5.5 REINFORCE++

6 Reinforcement Learning from Verified Rewards(RLVR)

6.1 GRPO

6.1.1 Objective Function

The objective function of GRPO is defined as:

\[ \begin{split} \mathcal{J}_{\text{GRPO}}(\pi_\theta) &= \mathbb{E}_{q \sim \mathcal{D}} \;\mathbb{E}_{o \sim \pi_{\theta_{\text{old}}}(\cdot\mid q)} \left[ \sum_{t=1}^{|o|} \min \Big( \rho_t \hat{A}_{t}, \text{clip}\big( \rho_t, 1 - \epsilon_{\text{clip}}, 1 + \epsilon_{\text{clip}} \big)\hat{A}_{t} \Big) \;-\;\beta \,\text{KL}\big(\pi_\theta \,\|\, \pi_{\text{ref}}\big) \right] \\ &\approx \frac{1}{B}\sum_{b=1}^{B} \left[ \frac{1}{G}\sum_{i=1}^{G} \frac{1}{|o_{b,i}|}\sum_{t=1}^{|o_{b,i}|} \min \Big( \rho_{b,i,t} \hat{A}_{b,i,t}, \text{clip}\big( \rho_{b,i,t}, 1 - \epsilon_{\text{clip}}, 1 + \epsilon_{\text{clip}} \big)\hat{A}_{b,i,t} \Big) \;-\;\beta \,\text{KL}\big(\pi_\theta \,\|\, \pi_{\text{ref}}\big) \right] \end{split} \tag{30}\]

Where:

\[ \mathcal{J}_{\text{GRPO}}(\pi_\theta) = \tag{31}\]

其中:

\[ \rho(o_{i,t}) = \frac{\pi_\theta(o_{i,t}|q, o_{i,<t})}{\pi_{\text{ref}}(o_{i,t}|q, o_{i,<t})} \tag{32}\]

6.2 Dr.GRPO

6.3 DAPO

6.4 GSPO

6.5 CISPO

6.6 SAPO

6.7 GDPO

7 LLM-RL in Practice

7.1 TRL: Transformer Reinforcement Learning

Link

7.2 verl: Volcano Engine Reinforcement Learning for LLMs

Link

7.3 OpenRLHF

Link

7.4 SWIFT (Scalable lightWeight Infrastructure for Fine-Tuning)

Link

8 Challenges & Further Direction

Back to top

9 Summary

Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” March 4, 2022. https://doi.org/10.48550/arXiv.2203.02155.
Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2024. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” July 29, 2024. https://doi.org/10.48550/arXiv.2305.18290.
Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. “Proximal Policy Optimization Algorithms.” August 28, 2017. https://doi.org/10.48550/arXiv.1707.06347.