DUO: Diverse, Uncertain, On-Policy Query Generation and Selection for Reinforcement Learning from Human Feedback

Xuening Feng, Zhaohui Jiang, Timo Kaufmann, Puchen Xu, Eyke HüllermeierPaul Weng, and Yifei Zhu
In Proceedings of the AAAI Conference on Artificial Intelligence, 2025

Abstract

Defining a reward function is usually a challenging but critical task for the system designer in reinforcement learning, especially when specifying complex behaviors. Reinforcement learning from human feedback (RLHF) emerges as a promising approach to circumvent this. In RLHF, the agent typically learns a reward function by querying a human teacher using pairwise comparisons of trajectory segments. A key question in this domain is how to reduce the number of queries necessary to learn an informative reward function since asking a human teacher too many queries is impractical and costly. To tackle this question, we propose DUO, a novel method for diverse, uncertain, on-policy query generation and selection in RLHF. Our method produces queries that are (1) more relevant for policy training (via an on-policy criterion), (2) more informative (via a principled measure of epistemic uncertainty), and (3) diverse (via a clustering-based filter). Experimental results on a variety of locomotion and robotic manipulation tasks demonstrate that our method can outperform state-of-the-art RLHF methods given the same total budget of queries, while being robust to possibly irrational teachers.

Cite

@inproceedings{feng2025duo,
  slug = {duo},
  author = {Feng, Xuening and Jiang, Zhaohui and Kaufmann, Timo and Xu, Puchen and Hüllermeier, Eyke and Weng, Paul and Zhu, Yifei},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  title = {DUO: Diverse, Uncertain, On-Policy Query Generation and Selection for Reinforcement Learning from Human Feedback},
  year = {2025},
  doi = {10.1609/aaai.v39i16.33824}
}