This repository serves as a collection of research notes and resources on training large language models (LLMs) and Reinforcement Learning from Human Feedback (RLHF). It focuses on the latest research, methodologies, and techniques for fine-tuning language models.
A curated list of materials providing an introduction to RL and RLHF:
- Research papers and books covering key concepts in reinforcement learning.
- Video lectures explaining the fundamentals of RLHF.
An extensive collection of state-of-the-art approaches for optimizing preferences and model alignment:
- Key techniques such as PPO, DPO, KTO, ORPO, and more.
- The latest ArXiv publications and publicly available implementations.
- Analysis of effectiveness across different optimization strategies.
This repository is designed as a reference for researchers and engineers working on reinforcement learning and large language models. If you're interested in model alignment, experiments with DPO and its variants, or alternative RL-based methods, you will find valuable resources here.
- Reinforcement Learning: An Overview
- A COMPREHENSIVE SURVEY OF LLM ALIGNMENT TECHNIQUES: RLHF, RLAIF, PPO, DPO AND MORE
- Book-Mathematical-Foundation-of-Reinforcement-Learning
- The FASTEST introduction to Reinforcement Learning on the internet
- rlhf-book
- PPO - Proximal Policy Optimization Algorithm - OpenAI
- DPO - Direct Preference Optimization: Your Language Model is Secretly a Reward Model - Standford
- online DPO
- KTO - KTO: Model Alignment as Prospect Theoretic Optimization
- SimPO imple Preference Optimization with a Reference-Free Reward - Princeton
- ORPO - Monolithic Preference Optimization without Reference Model - Kaist AI
- Sample Efficient Reinforcement Learning with REINFORCE
- REINFORCE++
- RPO Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment
- RLOO - Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
- GRPO
- ReMax - Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models
- DPOP - Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
- BCO - Binary Classifier Optimization for Large Language Model Alignment
Method |
---|
DPO |
Notes for learning RL: Value Iteration -> Q Learning -> DQN -> REINFORCE -> Policy Gradient Theorem -> TRPO -> PPO
- CS234: Reinforcement Learning Winter 2025
- CS285 Deep Reinforcement Learning
- Welcome to Spinning Up in Deep RL
- deep-rl-course from Huggingface
- RL Course by David Silver
- Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.
- Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math
- GRPO vs PPO
- Reasoning LLMs
- Process Reinforcement through Implicit Rewards
- DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL
- On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
- LIMR: Less is More for RL Scaling
- LIMO: Less Is More for Reasoning
- s1: Simple test-time scaling and s1.1
- The 37 Implementation Details of Proximal Policy Optimization
- Online-DPO-R1: Unlocking Effective Reasoning Without the PPO Overhead and github
- a reinforcement learning guide
- Approximating KL Divergence
- How to align open LLMs in 2025 with DPO & and synthetic data
- DeepSeek-R1 -> The Illustrated DeepSeek-R1, DeepSeek R1's recipe to replicate o1 and the future of reasoning LMs, DeepSeek R1 and R1-Zero Explained
- SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
- ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates
- A Minimalist Approach to Offline Reinforcement Learning
- Training Language Models to Reason Efficiently
- Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
- [R1 - distill] https://huggingface.co/datasets/open-r1/OpenR1-Math-220k
- [R1 - distill] https://huggingface.co/datasets/simplescaling/s1K-1.1
- [R1 - distill] https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k
- [R1 - distill] https://huggingface.co/datasets/GAIR/LIMO
- [R1 - distill] https://huggingface.co/datasets/AI-MO/NuminaMath-CoT