Publications | Timo Kaufmann

2025

ICLR
Inverse Constitutional AI: Compressing Preferences into Principles

Arduin Findeis, Timo Kaufmann, Eyke Hüllermeier, and 2 more authors

In Proceedings of the International Conference on Learning Representations (ICLR), 2025

Abs Bib PDF

Feedback data plays an important role in fine-tuning and evaluating state-of-the-art AI models. Often pairwise text preferences are used: given two texts, human (or AI) annotators select the "better" one. Such feedback data is widely used to align models to human preferences (e.g., reinforcement learning from human feedback), or to rank models according to human preferences (e.g., Chatbot Arena). Despite its wide-spread use, prior work has demonstrated that human-annotated pairwise text preference data often exhibits unintended biases. For example, human annotators have been shown to prefer assertive over truthful texts in certain contexts. Models trained or evaluated on this data may implicitly encode these biases in a manner hard to identify. In this paper, we formulate the interpretation of existing pairwise text preference data as a compression task: the Inverse Constitutional AI (ICAI) problem. In constitutional AI, a set of principles (or constitution) is used to provide feedback and fine-tune AI models. The ICAI problem inverts this process: given a dataset of feedback, we aim to extract a constitution that best enables a large language model (LLM) to reconstruct the original annotations. We propose a corresponding initial ICAI algorithm and validate its generated constitutions quantitatively based on reconstructed annotations. Generated constitutions have many potential use-cases – they may help identify undesirable biases, scale feedback to unseen data or assist with adapting LLMs to individual user preferences. We demonstrate our approach on a variety of datasets: (a) synthetic feedback datasets with known underlying principles; (b) the AlpacaEval dataset of cross-annotated human feedback; and (c) the crowdsourced Chatbot Arena data set. We release the code for our algorithm and experiments at https://github.com/rdnfn/icai.
@inproceedings{findeis2025inverse, title = {Inverse {{Constitutional AI}}: {{Compressing Preferences}} into {{Principles}}}, booktitle = {Proceedings of the International Conference on Learning Representations ({{ICLR}})}, shorttitle = {Inverse {{Constitutional AI}}}, author = {Findeis, Arduin and Kaufmann, Timo and H{\"u}llermeier, Eyke and Albanie, Samuel and Mullins, Robert}, year = {2025} }

AAAI

DUO: Diverse, Uncertain, On-Policy Query Generation and Selection for Reinforcement Learning from Human Feedback

Xuening Feng, Zhaohui Jiang, Timo Kaufmann, and 4 more authors

In Proceedings of the AAAI Conference on Artificial Intelligence, 2025

Bib PDF

@inproceedings{feng2025duo,
  author = {Feng, Xuening and Jiang, Zhaohui and Kaufmann, Timo and Xu, Puchen and Hüllermeier, Eyke and Weng, Paul and Zhu, Yifei},
  booktitle = {Proceedings of the AAAI Conference on Artificial Intelligence},
  title = {DUO: Diverse, Uncertain, On-Policy Query Generation and Selection for Reinforcement Learning from Human Feedback},
  year = {2025},
}

2024

arXiv

Problem Solving Through Human-AI Preference-Based Cooperation

Subhabrata Dutta, Timo Kaufmann, Goran Glavaš, and 7 more authors

2024

Bib arXiv

@misc{dutta2024problem,
  title = {Problem {{Solving Through Human-AI Preference-Based Cooperation}}},
  author = {Dutta, Subhabrata and Kaufmann, Timo and Glava{\v s}, Goran and Habernal, Ivan and Kersting, Kristian and Kreuter, Frauke and Mezini, Mira and Gurevych, Iryna and H{\"u}llermeier, Eyke and Schuetze, Hinrich},
  year = {2024},
  eprint = {2408.07461},
  archiveprefix = {arXiv}
}

MHFAIA
Comparing Comparisons: Informative and Easy Human Feedback with Distinguishability Queries

Xuening Feng, Zhaohui Jiang, Timo Kaufmann, and 3 more authors

In ICML 2024 Workshop on Models of Human Feedback for AI Alignment (MHFAIA), 2024

Abs Bib PDF 🔗

Learning human objectives from preference feedback has significantly advanced reinforcement learning (RL) in domains with hard-to-formalize objectives. Traditional methods with pairwise trajectory comparisons face challenges: trajectories with subtle differences are hard to compare, and comparisons are ordinal, limiting direct inference of preference strength. In this paper, we introduce the distinguishability query, where humans compare two pairs of trajectories and indicate which pair is easier to compare and then give preference feedback on the easier pair. This type of query directly infers preference strength and is expected to reduce cognitive load on the labeler. We also connect this query to cardinal utility and difference relations, and develop an efficient query selection scheme to achieve better trade-off between query informativeness and easiness. Experimental results empirically demonstrates the potential of our method for faster, data-efficient learning and improved user-friendliness on RLHF benchmarks.
@inproceedings{feng2024comparing, title = {Comparing {{Comparisons}}: {{Informative}} and {{Easy Human Feedback}} with {{Distinguishability Queries}}}, shorttitle = {Comparing {{Comparisons}}}, booktitle = {{{ICML}} 2024 {{Workshop}} on {{Models}} of {{Human Feedback}} for {{AI Alignment}} ({{MHFAIA}})}, author = {Feng, Xuening and Jiang, Zhaohui and Kaufmann, Timo and H{\"u}llermeier, Eyke and Weng, Paul and Zhu, Yifei}, year = {2024} }
MHFAIA
Relatively Rational: Learning Utilities and Rationalities Jointly from Pairwise Preferences

Taku Yamagata, Tobias Oberkofler, Timo Kaufmann, and 3 more authors

In ICML 2024 Workshop on Models of Human Feedback for AI Alignment (MHFAIA), 2024

Abs Bib PDF 🔗

Learning utilities from preference feedback has become increasingly important, particularly in fine-tuning language models such as ChatGPT. Traditional methods often assume equal rationality among labellers, leading to inaccurate utility estimates. We propose an algorithm that jointly estimates trainer rationality and item utilities to enhance utility learning and gain additional insights from feedback. Our approach focuses on settings where feedback is received from multiple trainers, using the Boltzmann-rational model to relate choices to latent utilities while accounting for varying levels of rationality. Given shared utilities, our method identifies rationality ratios among trainers from observed choices without extra calibration data or assumptions. We analyse the theoretical impact of assuming equal rationality on utility accuracy and empirically show superior performance in an action-advice setting, where agents construct policies using the learned utilities as rewards. By accurately modelling trainer rationality, we can enhance high-quality feedback collection, potentially leading to better-aligned models and an improved understanding of human preferences.
@inproceedings{yamagata2024relatively, title = {Relatively {{Rational}}: {{Learning Utilities}} and {{Rationalities Jointly}} from {{Pairwise Preferences}}}, shorttitle = {Relatively {{Rational}}}, booktitle = {{{ICML}} 2024 {{Workshop}} on {{Models}} of {{Human Feedback}} for {{AI Alignment}} ({{MHFAIA}})}, author = {Yamagata, Taku and Oberkofler, Tobias and Kaufmann, Timo and Bengs, Viktor and H{\"u}llermeier, Eyke and {Santos-Rodriguez}, Raul}, year = {2024} }
RLBRew
OCALM: Object-Centric Assessment with Language Models

Timo Kaufmann, Jannis Blüml, Antonia Wüst, and 3 more authors

In RLC 2024 Workshop on Reinforcement Learning Beyond Rewards (RLBRew), 2024

Abs Bib PDF arXiv 🔗

Properly defining a reward signal to efficiently train a reinforcement learning (RL) agent is a challenging task. Designing balanced objective functions from which a desired behavior can emerge requires expert knowledge, especially for complex environments. Learning rewards from human feedback or using large language models (LLMs) to directly provide rewards are promising alternatives, allowing non-experts to specify goals for the agent. However, black-box reward models make it difficult to debug the reward. In this work, we propose Object-Centric Assessment with Language Models (OCALM) to derive inherently interpretable reward functions for RL agents from natural language task descriptions. OCALM uses the extensive world-knowledge of LLMs while leveraging the object-centric nature common to many environments to derive reward functions focused on relational concepts, providing RL agents with the ability to derive policies from task descriptions.
@inproceedings{kaufmann2024ocalm, title = {{{OCALM}}: {{Object-Centric Assessment}} with {{Language Models}}}, booktitle = {{{RLC}} 2024 {{Workshop}} on {{Reinforcement Learning Beyond Rewards}} ({{RLBRew}})}, author = {Kaufmann, Timo and Bl{\"u}ml, Jannis and W{\"u}st, Antonia and Delfosse, Quentin and Kersting, Kristian and H{\"u}llermeier, Eyke}, year = {2024} }

2023

arXiv
A Survey of Reinforcement Learning from Human Feedback

Timo Kaufmann, Paul Weng, Viktor Bengs, and 1 more author

2023

Abs Bib PDF arXiv

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that learns from human feedback instead of relying on an engineered reward function. Building on prior work on the related setting of preference-based reinforcement learning (PbRL), it stands at the intersection of artificial intelligence and human-computer interaction. This positioning offers a promising avenue to enhance the performance and adaptability of intelligent systems while also improving the alignment of their objectives with human values. The training of large language models (LLMs) has impressively demonstrated this potential in recent years, where RLHF played a decisive role in targeting the model’s capabilities toward human objectives. This article provides a comprehensive overview of the fundamentals of RLHF, exploring the intricate dynamics between machine agents and human input. While recent focus has been on RLHF for LLMs, our survey adopts a broader perspective, examining the diverse applications and wide-ranging impact of the technique. We delve into the core principles that underpin RLHF, shedding light on the symbiotic relationship between algorithms and human feedback, and discuss the main research trends in the field. By synthesizing the current landscape of RLHF research, this article aims to provide researchers as well as practitioners with a comprehensive understanding of this rapidly growing field of research.
@misc{kaufmann2023survey, title = {A {{Survey}} of {{Reinforcement Learning}} from {{Human Feedback}}}, author = {Kaufmann, Timo and Weng, Paul and Bengs, Viktor and H{\"u}llermeier, Eyke}, year = {2023}, eprint = {2312.14925}, archiveprefix = {arXiv}, }
HLDM
On the Challenges and Practices of Reinforcement Learning from Real Human Feedback

Timo Kaufmann, Sarah Ball, Jacob Beck, and 2 more authors

In Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2023

Abs Bib PDF 🪧 Slides Vid 🔗

Reinforcement learning from human feedback (RLHF) is a variant of reinforcement learning (RL) that does not require an engineered reward function but instead learns from human feedback. Due to its increasing popularity, various authors have studied how to learn an accurate reward model from only few samples, making optimal use of this feedback. Because of the cost and complexity of user studies, however, this research is often conducted with synthetic human feedback. Such feedback can be generated by evaluating behavior based on ground-truth rewards which are available for some benchmark tasks. While this setting can help evaluate some aspects of RLHF, it differs from practical settings in which synthetic feedback is not available. Working with real human feedback brings additional challenges that cannot be observed with synthetic feedback, including fatigue, inter-rater inconsistencies, delay, misunderstandings, and modality-dependent difficulty. We describe and discuss some of these challenges together with current practices and opportunities for further research in this paper.
@inproceedings{kaufmann2023challenges, title = {On the~{{Challenges}} and~{{Practices}} of~{{Reinforcement Learning}} from~{{Real Human Feedback}}}, booktitle = {Machine {{Learning}} and {{Principles}} and {{Practice}} of {{Knowledge Discovery}} in {{Databases}}}, author = {Kaufmann, Timo and Ball, Sarah and Beck, Jacob and H{\"u}llermeier, Eyke and Kreuter, Frauke}, pages = {276--294}, editor = {Meo, Rosa and Silvestri, Fabrizio}, publisher = {Springer Nature Switzerland}, doi = {10.1007/978-3-031-74627-7_21}, year = {2023} }
ML4CPS
Reinforcement Learning from Human Feedback for Cyber-Physical Systems: On the Potential of Self-Supervised Pretraining

Timo Kaufmann, Viktor Bengs, and Eyke Hüllermeier

In Proceedings of the International Conference on Machine Learning for Cyber-Physical Systems (ML4CPS), 2023

Abs Bib PDF Slides 🔗

In this paper, we advocate for the potential of reinforcement learning from human feedback (RLHF) with self-supervised pretraining to increase the viability of reinforcement learning (RL) for real-world tasks, especially in the context of cyber-physical systems (CPS). We identify potential benefits of self-supervised pretraining in terms of the query sample complexity, safety, robustness, reward exploration and transfer. We believe that exploiting these benefits, combined with the generally improving sample efficiency of RL, will likely enable RL and RLHF to play an increasing role in CPS in the future.
@inproceedings{kaufmann2023reinforcement, title = {Reinforcement {{Learning}} from~{{Human Feedback}} for~{{Cyber-Physical Systems}}: {{On}} the~{{Potential}} of~{{Self-Supervised Pretraining}}}, booktitle = {Proceedings of the {{International Conference}} on {{Machine Learning}} for {{Cyber-Physical Systems}} ({{ML4CPS}})}, author = {Kaufmann, Timo and Bengs, Viktor and H{\"u}llermeier, Eyke}, year = {2023}, publisher = {Springer Nature Switzerland}, doi = {10.1007/978-3-031-47062-2_2} }