alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

We're hiring
PaperBlogResources

Online Rubrics Elicitation from Pairwise Comparisons

BibTex
Copy
@misc{rezaeiWed Oct 08 2025 17:44:59 GMT+0000 (Coordinated Universal Time)onlinerubricselicitation,
      title={Online Rubrics Elicitation from Pairwise Comparisons},
      author={MohammadHossein Rezaei and Robert Vacareanu and Zihao Wang and Clinton Wang and Yunzhong He and Afra Feyza Akyürek},
      year={Wed Oct 08 2025 17:44:59 GMT+0000 (Coordinated Universal Time)},
      eprint={2510.07284},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2510.07284},
}
AI Audio Lecture + Q&A
0:00 / 0:00
Online Rubrics Elicitation from Pairwise Comparisons
Transcript
Speaker 1: So, we've all been deeply immersed in the world of large language models and their alignment, especially through methods like Reinforcement Learning from Human Feedback, or RLHF. But there's a recurring pain point, isn't there? The static nature of evaluation rubrics. This Scale AI paper, 'Online Rubrics Elicitation from Pairwise Comparisons,' really dives into solving that by making our evaluation criteria dynamic. It's a significant step beyond simply having human-written or even synthetically generated fixed rubrics. Speaker 2: Absolutely. The reward-hacking problem, where models game the system by satisfying explicit criteria without genuinely producing high-quality output, is something we frequently encounter. So this dynamic approach sounds like it's tackling a core limitation of current RLHF paradigms. Speaker 1: Exactly. The core problem they identify is that static rubrics, whether human-authored or synthetically generated, are inherently incomplete. They can't possibly foresee all the emergent behaviors, both good and bad, that an LLM will exhibit during training. This leads to two main issues: reward hacking, where the model satisfies the letter of the law but not the spirit, and missing out on emergent desiderata, meaning new, desirable qualities aren't rewarded, and novel errors aren't penalized. Think of it like a game where the rules are fixed at the start, but players evolve new strategies the designers never anticipated. The LLM becomes a master player of the *current* rules, not necessarily the *ideal* game. Their solution, OnlineRubrics, is to dynamically augment these rubrics. Instead of trying to write a perfect rubric upfront, they introduce an LLM-based 'extractor' that, at each training step, looks at pairs of responses – one from the current model and one from a control policy – and identifies *meaningful differences*. Crucially, these differences are then transformed into *new, weighted, binary-checkable evaluation criteria* that are added to the existing rubric. This allows the reward signal to evolve with the model's capabilities, constantly refining what 'good' means. It's like having a dynamic rulebook that expands and adapts as the game is played. Speaker 2: That's a powerful shift. So, the key is the online, adaptive nature and the use of pairwise comparisons to identify these new criteria. Why is comparing two responses so critical here? Why not just evaluate a single response against existing criteria and then try to elicit new ones from that single output? Speaker 1: That's a great question, and it's one of their core insights. The paper shows that pairwise comparison is significantly more effective than pointwise elicitation. An LLM, when acting as an 'extractor,' is far better at identifying *discriminative properties* – what makes one response better or worse than another – when it has two concrete examples to compare. If you just give it one response, it might generate generic desiderata based on its internal knowledge. But by contrasting a potentially better output from the current policy with a potentially worse one from the control policy, the extractor can pinpoint very specific, sample-grounded differences that lead to new, highly relevant criteria. It's like a comparative literary critic, analyzing two texts to highlight their unique merits and flaws, rather than just critiquing one in isolation. Technically, they integrate this into a GRPO, or Group Reinforcement Policy Optimization, algorithm. An LLM grader evaluates each output against the *augmented* rubric to compute rewards. The elicited criteria themselves must be grounded in one of the responses, preventing hallucinated or external knowledge-based criteria. After elicitation, a deduplication step merges similar criteria, ensuring efficiency. This continuous feedback loop means the LLM isn't just learning to satisfy criteria; it's actively helping to *define* the criteria for its own improvement. This is one of the crucial insights: the LLM itself is creating the richer, more specific reward function. Speaker 2: I see. So, instead of just saying 'this response is bad,' it's more like 'this response is bad because it lacks X, which the other response has,' and then X becomes a new criterion. That makes a lot of sense for generating actionable feedback. And it sounds like the performance gains were quite substantial, even on unseen human-written rubrics, suggesting the model internalizes the underlying desired behaviors better. Speaker 1: Exactly. The results were quite compelling. They showed consistent improvements across generalist and expert tasks, outperforming static rubrics and even other baselines like fixed universal requirements or pointwise elicitation. The qualitative analysis of the elicited criteria was also fascinating, revealing recurring themes like an emphasis on reproducibility, practical feasibility, and explicit anti-gaming principles, pushing the models toward more robust and responsible behaviors. This research really shifts the paradigm of reward engineering. It moves us away from the manual, often incomplete, process of defining every single evaluation point upfront, towards an adaptive, semi-autonomous system. It significantly strengthens the 'LLM-as-a-Judge' concept, showing that LLMs aren't just good at evaluating, but also at *constructing* the very frameworks for evaluation. This opens up avenues for more sophisticated forms of LLM self-improvement, where models could potentially refine their own learning objectives and evaluative standards over time. Speaker 2: So, essentially, it's making LLMs more proactive in their own training, leading to models that are not only more aligned but also more robust against unforeseen challenges. That's a powerful implication for the future of AI development. It hints at a recursive self-improvement loop for alignment. Speaker 1: Precisely. The main takeaway for me is that adaptive evaluation is crucial for training truly aligned and robust LLMs for complex, open-ended tasks. By enabling rubrics to evolve dynamically with the model's capabilities, we're building systems that are less prone to reward hacking and more capable of exhibiting genuinely desirable and nuanced behaviors. It's about moving from a fixed target to a moving, self-adjusting goalpost, ensuring the model is always aiming for the 'best' version of itself. Speaker 2: Right, an ever-improving reward function for an ever-improving model. It's a fundamental step towards more intelligent and adaptable AI systems.