RLHF for LLMs: A Deep Dive into Reinforcement Learning from Human Feedback

Published in

The AI Technology

5 min readJan 23, 2024

RLHF or reinforcement learning from human feedback is a machine learning technique used to optimize the performance of an AI model through integration of human insight and reinforcement learning. RLHF is highly instrumental in training and fine-tuning GenAI models, including large language models (LLMs), to perform a host of tasks, such as realistic content generation, responding to questions, and code generation. LLMs trained through RLHF generate outputs that are informative and in alignment with human values.

RLHF is a subtle approach to dealing with tasks that have complex, ill-defined and difficult-to-specify goals. It has widely been leveraged to boost the accuracy, relevance and ethics of gen AI models.

Why RLHF for LLM

Large language models are trained to generate outputs that reflect the patterns available in training data. While recent breakthroughs in technology have broadened the application of LLMs as chatbot engines, or even as reasoning engines in general AI applications, the core functionality of these models remains predicting the next word in response to prompts based on patterns learned from their training data.

Large language models are initially developed with conventional methods to generate coherent and grammatically correct texts. They don’t have the inherent ability to understand user intent and generate contextually relevant output without specific instructions. Despite their grammatical proficiency, training LLMs to produce contextually relevant content is an uphill task.

Companies building AI models adopted the RLHF approach to improve the performance of language models. This is the reason why OpenAI’s InstructGPT models trained with human feedback outperform its previous GPT models, specifically in following user intentions, producing accurate results and preventing model hallucination.

Reinforcement learning with human feedback for LLM can bring more benefits to developers than larger training datasets. According to an OPenAI research, its evaluators chose outputs generated using the 1.3B-parameter InstructGPT model over outputs from a 175B GPT-3 model.

How RLHF Works?

The process of training GenAI models, particularly large models, with RLHF typically involves the following phases.

Starting with Pre-training Models

RLHF is used to fine-tune and improve the performance of a large language model that has been trained on a large dataset, instead of training the model from scratch. For example, InstructGPT model leveraged RLHF to build on the pre-existing Generative Pre-trained Transformer (GPT) model for model augmentation and enhanced performance.

Supervised Fine-Tuning

Supervised fine-tuning is employed to train the model to produce results expected by users before explicit reinforcement learning is introduced. If the result is not aligned with the user’s goal, supervised fine-tuning employs supervised learning to train the model in such a way that it can respond appropriately to different kinds of prompts.

Human experts in the loop create labeled examples, in a structured manner (prompt, response), to train the model to identify patterns and respond to prompts in that particular area. However, creating these labeled examples manually is expensive and time-consuming.

Creating Reward Model

A reward model is needed to translate human feedback into a format that the reinforcement learning model can understand. The complexity of human preferences and values makes it challenging to determine how to convert them into numerical reward signals, and there is no straightforward mathematical formula for the same. Hence, designing an effective reward model that can translate subjective human preference into a numerical reward signal suitable for the reinforcement learning model is a vital step in RLHF.

The reward model is supplied with a massive amount of training data in the form of direct feedback from human labelers to train the model to learn to imitate human preferences in allocating rewards to different kinds of responses generated by the model. This enables the model to continue its training without a human in the loop.

RL Policy optimization

Once the reward system is ready, a feedback loop is required to train and fine-tune the reinforcement learning (RL) policy. In this case, the RL policy is the same policy used by the original large language model. It takes reward signals given by the reward model, improves the results and sends them to the reward model for evaluation, creating a feedback loop between the RL policy and reward model.

Based on the reward score assigned by the reward mode, the RL policy creates content that it considers contextually relevant in line with the feedback from the human-trained reward system. The iteration goes on until the RL agent achieves desirable output.

Limitations of RLHF

Despite its impressive capabilities in training AI models to execute complex tasks from robotics to NLP, the application of RLHF has some limitations as under:

Expensive and resource-intensive: RLHF requires firsthand input from human evaluators. Training generative AI modaIs, especially large models, through RLHF can be very expensive and resource-intensive.

Complexity of human preferences: In RLHF, the model evaluation depends on human preferences. However, human labelers may sometimes not build firm consensus on not only alleged or factual information but also on defining what is considered ‘appropriate’ model behavior. The complexity of human preference thus poses challenges in establishing a genuine ‘ground truth’ to judge the model’s performance.

Human fallibility and adversarial intention: Human labelers are not infallible and can make judgemental errors, or even have genuine contrarian views. Some human evaluators may intentionally provide provocative or disruptive inputs to undermine the AI learning model rather than contributing constructively.

Final words

LLMs can inherently produce grammatically correct and structurally coherent output, but training them to generate contextually relevant and factually correct results is a challenging task. Reinforcement learning from human feedback (RLHF) fine-tunes these models to generate content that is truthful, helpful and contextually relevant by incorporating feedback from evaluators. It bridges the gap between linguistic fluency and context-specific relevance, leading to the development of a more efficient and reliable language model.