Large Language Models (LLMs) are incredibly versatile, capable of tasks ranging from drafting emails to assisting in medical diagnoses. However, their broad applicability makes them difficult to evaluate systematically. This challenge has led researchers at MIT to explore a new approach that considers human beliefs about LLM capabilities.
The researchers argue that evaluating an LLM requires understanding how humans form beliefs about its performance. They propose a framework that measures the alignment between an LLM and a human generalization function â a model that captures how people update their beliefs about an LLM's capabilities after interacting with it.
The Human Factor in AI
The human generalization function is based on the idea that humans generalize from their experiences. If someone observes an LLM accurately answering matrix inversion questions, they might assume it can also handle basic arithmetic. However, LLMs don't always exhibit the same patterns of expertise as humans, leading to misalignments with human expectations.
The MIT researchers conducted a survey to measure how humans generalize about LLM performance. They presented participants with questions answered correctly or incorrectly by LLMs and then asked them to predict the outcome of related questions. The results revealed that humans are significantly less adept at generalizing LLM performance than they are at generalizing human performance.
Misaligned Expectations: A Barrier to LLM Success
The study found that human generalization about LLMs can be misleading. People tend to place more weight on incorrect responses, leading to a situation where simpler models sometimes outperform larger, more powerful models like GPT-4 in high-stakes situations. This is because users are more likely to be surprised by an LLM's failure, even on seemingly simple tasks, than they would be by a human's failure.
The researchers attribute this misalignment to the novelty of LLMs. Humans have far more experience interacting with other humans, leading to a better understanding of their limitations and strengths. As we interact more with LLMs, our understanding of their capabilities is likely to improve.
Bridging the Gap: Aligning LLMs with Human Expectations
The MIT researchers are actively working to incorporate the human generalization function into the development of LLMs. They believe that incorporating human expectations into the training and evaluation of LLMs can lead to models that are more reliable and effective in real-world applications.
The researchers also emphasize the importance of their dataset as a benchmark for comparing LLM performance against the human generalization function. This could help developers identify and address potential misalignments that hinder LLM performance in real-world scenarios.
The Takeaway: The MIT research highlights the crucial need to account for human expectations when evaluating and deploying LLMs. By understanding the human generalization function, we can build better, more reliable AI systems that can effectively collaborate with humans and achieve their full potential.