Relatively Rational: Learning Utilities and Rationalities Jointly from Pairwise Preferences

Taku YamagataTobias OberkoflerTimo KaufmannViktor BengsEyke Hüllermeier, and Raul Santos-Rodriguez
In ICML 2024 Workshop on Models of Human Feedback for AI Alignment (MHFAIA), 2024

Abstract

Learning utilities from preference feedback has become increasingly important, particularly in fine-tuning language models such as ChatGPT. Traditional methods often assume equal rationality among labellers, leading to inaccurate utility estimates. We propose an algorithm that jointly estimates trainer rationality and item utilities to enhance utility learning and gain additional insights from feedback. Our approach focuses on settings where feedback is received from multiple trainers, using the Boltzmann-rational model to relate choices to latent utilities while accounting for varying levels of rationality. Given shared utilities, our method identifies rationality ratios among trainers from observed choices without extra calibration data or assumptions. We analyse the theoretical impact of assuming equal rationality on utility accuracy and empirically show superior performance in an action-advice setting, where agents construct policies using the learned utilities as rewards. By accurately modelling trainer rationality, we can enhance high-quality feedback collection, potentially leading to better-aligned models and an improved understanding of human preferences.

Cite

@inproceedings{yamagata2024relatively,
  slug = {relatively-rational},
  title = {Relatively {{Rational}}: {{Learning Utilities}} and {{Rationalities Jointly}} from {{Pairwise Preferences}}},
  shorttitle = {Relatively {{Rational}}},
  booktitle = {{{ICML}} 2024 {{Workshop}} on {{Models}} of {{Human Feedback}} for {{AI Alignment}} ({{MHFAIA}})},
  author = {Yamagata, Taku and Oberkofler, Tobias and Kaufmann, Timo and Bengs, Viktor and H{\"u}llermeier, Eyke and {Santos-Rodriguez}, Raul},
  year = {2024}
}