RLHF (Reinforcement Learning from Human Feedback)

RLHF is the method used to make AI helpful, honest, and harmless. After initial training, humans compare several AI responses and indicate which one is best. The AI then learns to produce the kind of responses humans prefer. This is what makes the difference between a raw model that can be toxic and a polite, useful assistant like Claude or ChatGPT.