LLM Part3: Alignment
On this page
In the part 1 and part 2 of the LLM series, we covered the architecture and inference techniques for LLMs. In this part 3, we will focus on alignment techniques, which are crucial for ensuring that LLMs behave in ways that are consistent with human values and intentions. We will explore various methods for aligning LLMs, including reinforcement learning from human feedback (RLHF), and discuss their implications for the development and deployment of these models. We will first explore the simple Supervised Fine-Tuning (SFT) approach, which involves fine-tuning LLMs on curated datasets that reflect human values and preferences. Than we will explore different RLHF techniques, which involve training LLMs using feedback from human evaluators to improve their alignment with human intentions. We will explore algorithms from PPO, DPO to GRPO and
1 Supervised Fine-Tuning (SFT)
2 Review of Reinforcement Learning
In this part, we will first review the basics of reinforcement learning, including key concepts such as rewards, policy, loss function, actor-critic methods, and more. This will provide a solid foundation for understanding the RLHF techniques we will explore later.
3 Format LLM Alignment as a Reinforcement Learning Problem
We know in reinforcement learning Section 2, an agent interacts with an environment to maximize cumulative rewards by taking actions based on its policy. In the context of LLM alignment, we can map these components as follows:
- Agent: The LLM being aligned.
- Environment: The context in which the LLM operates, including user inputs and external knowledge sources.
- Actions: The responses or outputs generated by the LLM.
- Rewards: Feedback signals that indicate how well the LLM’s outputs align with human values and intentions.
- Policy: The strategy used by the LLM to generate outputs based on the given inputs.
One thing good about this formulation is that it allows us to leverage existing reinforcement learning algorithms and techniques to improve the alignment of LLMs. And for the LLMs, we known the next state is deterministic given the current state and action, which simplifies the problem to some extent. With this formulation in place, we can now explore various RLHF techniques that have been proposed to align LLMs with human values and intentions.
4 RLHF Algorithms
4.1 Proximal Policy Optimization (PPO)
4.2 Direct Preference Optimization (DPO)
4.3 Group Relative Policy Optimization (GRPO)
https://arxiv.org/abs/2402.03300
4.4 Group Sequence Policy Optimization(GSPO)
https://arxiv.org/abs/2507.18071
4.5 Soft Adaptive Policy Optimization(SAPO)
Used in the Qwen3-VL (Bai et al. 2025) model.