GPT-4 Fails Steve Landsburg‘s Economics Exam: What This Reveals About AI‘s Understanding of Core Concepts
Professor Steve Landsburg of the University of Rochester recently gave an economics exam to ChatGPT, the latest natural language AI model from OpenAI, in order to test its comprehension of basic economic principles. The disappointing results provide an interesting case study on the current capabilities and limitations of AI in mastering complex human knowledge domains. Let‘s dive deeper into what this reveals.
Who is Steve Landsburg and What Was This Test?
First, some background. Steve Landsburg is a renowned economist and professor at the University of Rochester, known for his popular textbooks and writings explaining economic ideas. He previously gave a similar economics exam to GPT-3, an earlier version of ChatGPT, in which the AI scored 0 out of 90 points. Ouch!
To benchmark the progress of the new GPT-4 model, Landsburg devised a challenging exam covering standard microeconomics questions on topics like monopoly pricing, consumer surplus, and supply and demand. The test had 9 multi-part questions worth 10 points each, drawn from actual final exams Landsburg gives his students. This was no cakewalk.
The Results: GPT-4 Scores Just 4/90
So how did the promising new GPT-4 do? Unfortunately, it only scored 4 out of 90 points on the exam – failing miserably just like its predecessor. While this shows some improvement over GPT-3‘s complete zero, it highlights that GPT-4 still does not have a solid understanding of core economic principles despite advances in generating human-like text.
Let‘s dig into some examples that illustrate GPT-4‘s lack of comprehension:
One question asked the AI to analyze monopoly pricing and consumer surplus with and without the ability to charge an entry fee. GPT-4 was able to correctly calculate the profit without an entry fee, but failed to recognize that charging an entry fee would allow the monopolist to capture all consumer surplus and hence optimize pricing differently. This oversight demonstrates it doesn‘t fully grasp profit incentives.
In another multi-part question involving sequential pricing between a monopoly-owned store and parking lot, GPT-4 was unable to identify that demand curves remain unchanged regardless of parking pricing. It also failed to determine the profit-maximizing equilibrium outcome resulting from the stores‘ interdependent pricing decisions. This exposes gaps in comprehending how pricing factors affect consumer behavior and firm profit.
Many other AI models don‘t fare much better on these conceptual tasks. For example, Anthropic‘s Claude scores only 20-30% on complex reasoning problems, while Google‘s LaMDA gets less than 10% accuracy on common sense tasks. Human economics students, on the other hand, score 70-80% on Landsburg‘s exams. The gap is still wide!
The Root Issue: Lack of True Comprehension
What these examples reveal is that while GPT-4 can do some mathematical calculations, it does not actually comprehend economic reasoning. As Landsburg noted, the AI seems capable of "elementary material" but cannot grasp core theoretical concepts like profit incentives, consumer behavior, and market dynamics.
Without foundational knowledge of the how‘s and why‘s, GPT-4 fails at complex inference and analysis. This is concerning given that economics requires pulling together many abstract relationships into a holistic mental model. Symbolic reasoning remains a key frontier where human intelligence excels compared to even the most advanced AI systems today.
Perspectives on AI‘s Ability to Master Human Knowledge
The results have sparked debate within the AI community and among educators about the limits of large language models like GPT-4 for replicating human skills and expertise.
Some worry these AIs could eventually lead to erosion of careers requiring advanced training. For example, over 80 million jobs in the US require analytical thinking and judgment – skills seemingly far beyond GPT-4 based on this exam. Others argue models still lack the reasoning, critical thinking, and integration of knowledge that comes naturally to humans.
As AI researcher Gary Marcus noted, "Facts alone aren’t sufficient for deeper levels of understanding." While models like GPT-4 can process facts and generate coherent text, they do not learn meaning in the same conceptual way humans do. We are still far from machines that can reason broadly like humans.
There are also concerns that flawed AI systems could spread misinformation if users assume their output is accurate despite lack of comprehension. Landsburg‘s test helps ground the hype around AI and better understand where human intelligence maintains a distinctive edge, at least for now.
The Bottom Line
Clearly, more rigorous testing across different knowledge domains is required to fully map out AI‘s strengths and weaknesses. But for now, we can conclude that despite impressive advances, even the latest natural language models have a long way to go before achieving true mastery of complex subject matter the way strong human students can.
While concerning in some ways, this lag also presents opportunities. Teachers can focus on developing critical thinking and conceptual reasoning in students – skills not readily replicable by AI. Subject matter specialists can leverage AI tools for certain rote tasks while leading on strategy and high-order thinking. Together, humans and AI systems can cover each other‘s weaknesses for the best outcomes.
Though AI keeps getting smarter, the results of Landsburg‘s exam indicate there are still frontiers like conceptual reasoning where uniquely human strengths shine. We should continue pushing AI advances through tests like this economics exam, while staying grounded about capabilities. With ethical development and human guidance, AI and human intelligence can work together – not compete.