Humanity’s Last Exam: The Ultimate Test for AI

Feb 05, 2025

Imagine a final exam so tough that even the smartest humans would struggle—and even today’s best artificial intelligence (AI) systems barely get a few answers right. That’s exactly what “Humanity’s Last Exam” is all about. In simple terms, it’s a new test created by AI experts to see if AI has truly mastered human-level knowledge and reasoning.

What Is It?

Humanity’s Last Exam (often abbreviated as HLE) is a benchmark—a set of questions designed to measure how well an AI can answer academic problems. Unlike older tests where AI models easily score above 90%, this exam is filled with 3,000 challenging questions covering more than a hundred subjects ranging from math and science to history and literature. Some questions even mix text with images, making the exam “multi-modal.” Essentially, it’s meant to be the hardest academic test out there for AI.

“We wanted problems that would test the capabilities of the models at the frontier of human knowledge and reasoning.”
– Dan Hendrycks, co-founder of the Center for AI Safety

Why Create a New Test?

As AI models have improved over the years, they’ve started to “ace” many of the older tests. Benchmarks like MMLU (Massive Multitask Language Understanding) are now too easy for modern AI; they can answer most of the questions correctly because they’ve seen similar problems during their training. This makes it hard for researchers and policymakers to understand how capable these systems really are. Humanity’s Last Exam was developed to fill that gap—it’s so tough that current state-of-the-art models (think of versions like GPT-4o) only score around 3–10% on it.

In layman’s terms, imagine if every student in a class started scoring 100% on a test because they had memorized the answers. That test would no longer be useful to see who truly understands the material. The same happens with AI. HLE forces AI to work on questions it hasn’t “seen” before, giving us a clearer idea of its true problem-solving skills.

Who Made It and How?

This exam wasn’t created by one person or company. It’s a global collaborative effort organized by the nonprofit Center for AI Safety and the tech company Scale AI. Nearly 1,000 experts from more than 500 institutions around the world contributed tough, real academic questions that no current AI could easily answer. Some questions even require knowledge that goes beyond mere memorization—they demand careful reasoning and creative thinking.

“Humanity’s Last Exam is designed to be the final closed-ended academic benchmark of its kind with broad subject coverage.”
– Scale AI

How Does It Work?

Each question in the exam has one correct answer that is unambiguous and can be automatically checked. For example, a question might show a picture of an ancient inscription and ask for its translation, or it might pose a complex math problem that needs a precise numerical answer. The goal is to ensure that AI models are not just regurgitating information but are truly reasoning through a problem.

When AI models are tested on HLE, they perform very poorly compared to other benchmarks. For instance, while many popular tests see scores above 90%, models like GPT-4o and others score in the single digits. This stark difference shows that there is still a long way to go before AI can match the deep, expert-level understanding humans possess.

What Does It Tell Us About AI?

By pushing AI to solve questions that even experts find challenging, Humanity’s Last Exam serves as a “reality check” for the field. It highlights the gap between current AI capabilities and the high-level thinking that experts require for advanced academic work. Moreover, it provides researchers and policymakers with a common yardstick to measure progress, sparking discussions on both the potentials and risks of future AI developments.

In everyday language, while AI might be great at everyday tasks like chatting or basic problem solving, HLE reminds us that AI still struggles with tasks that require true understanding and reasoning. It’s like having a car that can drive well on a smooth highway (old benchmarks) but falters on a rugged, unpredictable mountain road (Humanity’s Last Exam).

Why Should You Care?

Even if you’re not a tech expert, the evolution of AI affects all of us. Advances in AI promise benefits like better healthcare and smarter technology—but they also bring challenges, such as ethical concerns and potential risks. Benchmarks like Humanity’s Last Exam help ensure that as AI becomes more integrated into our lives, its capabilities are thoroughly tested and understood. This, in turn, helps us build safer, more reliable systems that can eventually assist in solving real-world problems.

In summary, Humanity’s Last Exam is a super-tough test for AI. It’s designed to show us just how far AI has to go before it can truly think like a human when it comes to complex, academic tasks. By understanding its limitations today, we can better prepare for a future where AI plays an even bigger role in our lives.

🚀 About Us
At AI Horizon , we believe that AI and technology will shape future generations. That’s why we’re dedicated to delivering cutting-edge insights and innovative tools—all completely free for our readers. Explore our AI and productivity tools at alt4.in, subscribe to our newsletters (AI Horizon and Tech Horizon) for the latest updates, support our work on Ko-fi, and join our vibrant community on Discord.
Together, we’re building a smarter, more connected future. Subscribe, contribute, and join us today! 🔥

Roy Wilsker

A question I often think about is whether we would recognize a truly alien kind of intelligence and whether we could analyze/measure it by our (continually changing) standards. For example, we now know that tree systems can recognize when other trees are in trouble and send them nutrients and other helpful substances via the shared root system. Is that a kind of intelligence? Do we have any real understanding of whether whales or octopi have intelligence?

Many years ago, my late mother-in-law had a student who came back from taking an ETS-like exam. He was perplexed by a question about who was Christopher Robin’s friend. This was before all the Disney cartoons and this young man had a home and school background that did not include A.A. Milne, though the exam seemed to assume all its takers would have such knowledge. So, cultural background can skew intelligence tests. Would a brilliant Martian even comprehend some of the questions we ask AI?

We measure things by what we know and often fall short when encountering completely new phenomena.

Expand full comment

Michael Xie

In order to get results quick and fast, the word “intelligence” is mixed with “knowledge”and “skills”. Knowledge or skills make an intelligent individual very powerful, but that is not “intelligence”.

2 more comments...

AI Horizon