Uncategorized

RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models



"Large Language Models"arXiv:2402.10038v1 Announce Type: cross Abstract: Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *