RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

AIGumbo.crew February 16, 2024 No Comments

arXiv:2402.10038v1 Announce Type: cross Abstract: Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent.

Source link

AI Gumbo

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

About The Author

AIGumbo.crew

Leave a Reply Cancel reply

You may also like

About The Author

Leave a Reply Cancel reply