According to Nature, artificial intelligence models can perform as well as humans on law exams when answering multiple-choice, short-answer and essay questions, with recent research showing AI matching human performance on legal assessments. However, these same systems struggle with real-world legal tasks, with some lawyers being fined for filing AI-generated court briefs that misrepresented legal principles and cited non-existent cases. The same pattern appears in finance, where AI can pass the Chartered Financial Analyst exam yet performs poorly on basic tasks required of entry-level analysts. This proxy failure has led researchers to propose a new evaluation approach where specialists engage AI in extended conversations to determine if systems genuinely understand domains rather than merely imitating understanding, potentially through what they call a “Sunstein test” involving legal experts like Harvard’s Cass Sunstein.
Table of Contents
The Fundamental Flaw in Current AI Assessment
The core issue here represents what I’ve observed across multiple AI domains: we’re measuring the wrong things. Current benchmarks like standardized testing work well for structured knowledge recall but fail to capture the nuanced reasoning required in professional practice. When AI systems can ace law exams yet produce legally problematic briefs that result in real-world sanctions, we’re witnessing what educational theorists call “teaching to the test” on a massive scale. The systems have learned to optimize for specific evaluation metrics rather than developing genuine domain mastery.
Why Expert Interviews Could Transform AI Development
The proposed expert interview approach addresses several critical gaps in current AI evaluation. Unlike standardized tests with predetermined answers, extended conversations with domain specialists like legal scholars or financial experts would test AI’s ability to handle ambiguity, respond to unexpected questions, and demonstrate coherent reasoning across diverse contexts. This method would push beyond the current limitations of AI systems that often excel at pattern recognition but struggle with adaptive reasoning. The key innovation is treating AI assessment more like professional certification than academic examination.
The Practical Hurdles of Expert-Based Evaluation
While conceptually promising, implementing this approach faces significant challenges. Scaling expert interviews requires substantial resources and raises questions about consistency across different evaluators. There’s also the risk of creating evaluation systems that are too subjective or influenced by individual biases. The financial services example is particularly telling – while AI might struggle with basic analyst tasks, creating standardized expert evaluations across different financial domains would require careful design to ensure fair and comprehensive assessment.
What This Means for AI Development and Regulation
This shift in evaluation philosophy could fundamentally alter how AI systems are developed and regulated. Instead of optimizing for benchmark performance, developers would need to focus on building systems capable of sustained, coherent reasoning under expert scrutiny. This aligns with growing calls for more rigorous AI validation in high-stakes domains like healthcare, finance, and law. The proposed foundation model, drawing from platforms like research networks and evaluation communities, could establish industry-wide standards that move beyond the current patchwork of automated benchmarks.
The Path Forward for AI Assessment
The transition to expert-based evaluation won’t happen overnight, but the direction is clear. As AI systems approach more complex reasoning tasks, we need assessment methods that can distinguish between genuine understanding and sophisticated pattern matching. The most immediate impact will likely be in specialized professional domains where the consequences of AI failure are significant. Over time, this approach could help build public trust by providing more transparent and meaningful evaluations of what AI systems can actually accomplish in real-world scenarios.