<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>machine learning &#8211; That Freaky NewGuy</title>
	<atom:link href="https://freakynewguy.net/tag/machine-learning/feed/" rel="self" type="application/rss+xml" />
	<link>https://freakynewguy.net</link>
	<description>Just Another Noob</description>
	<lastBuildDate>Sat, 14 Mar 2026 22:16:10 +0000</lastBuildDate>
	<language>en-AU</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://freakynewguy.net/wp-content/uploads/2022/08/cropped-Noobicon-1-32x32.png</url>
	<title>machine learning &#8211; That Freaky NewGuy</title>
	<link>https://freakynewguy.net</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">209481562</site>	<item>
		<title>Exploring Humanity&#8217;s Last Exam for AI Intelligence Assessment</title>
		<link>https://freakynewguy.net/humanitys-last-exam-ai-test/</link>
					<comments>https://freakynewguy.net/humanitys-last-exam-ai-test/#respond</comments>
		
		<dc:creator><![CDATA[Freaky Newguy]]></dc:creator>
		<pubDate>Sat, 14 Mar 2026 22:01:36 +0000</pubDate>
				<category><![CDATA[AI]]></category>
		<category><![CDATA[Anything Else]]></category>
		<category><![CDATA[News From The Interwebs]]></category>
		<category><![CDATA[AI assessment]]></category>
		<category><![CDATA[AI benchmark]]></category>
		<category><![CDATA[artificial intelligence]]></category>
		<category><![CDATA[cognitive testing]]></category>
		<category><![CDATA[HLE]]></category>
		<category><![CDATA[Humanity's Last Exam]]></category>
		<category><![CDATA[machine learning]]></category>
		<guid isPermaLink="false">https://freakynewguy.net/?p=1360</guid>

					<description><![CDATA[<p>Humanity's Last Exam (HLE) is a new benchmark designed to assess AI's advanced reasoning with 2,500 expert-level questions. Unlike previous tests, HLE prioritises critical thinking over simple fact recall. While it highlights AI capabilities, critics argue it lacks real-world applicability and may not capture AI creativity or complex problem-solving.</p>
<p>The post <a rel="nofollow" href="https://freakynewguy.net/humanitys-last-exam-ai-test/">Exploring Humanity&#8217;s Last Exam for AI Intelligence Assessment</a> appeared first on <a rel="nofollow" href="https://freakynewguy.net">That Freaky NewGuy</a>.</p>
]]></description>
										<content:encoded><![CDATA[<h2 class="wp-block-heading" style="border-bottom: 2px solid #00c2ff; padding-bottom: 10px;">Humanity&#8217;s Last Exam: The AI Test That Could Stump Einstein?</h2>
<p class="wp-block-paragraph">Estimated reading time: 6 minutes</p>
<ul class="wp-block-paragraph">
<li><strong>Ultimate benchmark:</strong> HLE features 2,500 expert-level questions.</li>
<li><strong>Focus on reasoning:</strong> It assesses AI’s critical thinking and problem-solving skills.</li>
<li><strong>High-stakes testing:</strong> Designed to challenge the best AI models available.</li>
<li><strong>Implications for the future:</strong> Offers insights into AI capabilities and limitations.</li>
<li><strong>Not a free lunch:</strong> Critiques highlight its limitations in real-world application.</li>
</ul>
<h3 id="h-the-conception-of-hle" class="wp-block-heading" style="border-bottom: 2px solid #00c2ff; padding-bottom: 10px;">The Conception of HLE: A Brainchild of Necessity</h3>
<p class="wp-block-paragraph">HLE didn’t just materialize from thin air. It was conceived by the <a style="color: #00c2ff !important;" href="https://ai-safety-center">Center for AI Safety</a> and <a style="color: #00c2ff !important;" href="https://scale.com" target="_blank" rel="noopener">Scale AI</a>, among others, in response to a notable issue: existing tests were as effective as trying to teach a cat to fetch. With the likes of <a style="color: #00c2ff !important;" href="https://mmlu.org" target="_blank" rel="noopener">MMLU</a> (Massive Multitask Language Understanding) saturating the field, AI models were cruising through easier benchmarks. HLE was established as a high-stakes benchmark focusing on advanced reasoning rather than the boring old “recall this stuff” game.</p>
<p class="wp-block-paragraph">The Nature paper titled “A benchmark of expert-level academic questions to assess AI capabilities” lays the groundwork for HLE, with its focus on multi-step reasoning in disciplines like mathematics, natural sciences, humanities, computer science, literature, and history. Basically, it takes the &#8220;intelligence&#8221; in &#8220;artificial intelligence&#8221; and gives it a workout.</p>
<h3 id="h-the-structure-of-hle" class="wp-block-heading" style="border-bottom: 2px solid #00c2ff; padding-bottom: 10px;">The Structure of HLE: Questioning Everything (Almost)</h3>
<h4 id="h-key-features" class="wp-block-heading" style="border-bottom: 2px solid #00c2ff; padding-bottom: 10px;">Key Features</h4>
<p class="wp-block-paragraph">HLE is composed of a whopping <strong>2,500 public questions</strong>, with an additional <strong>~500 holdout questions</strong> that remain guarded like celebrity secrets. Here’s the breakdown:</p>
<ul class="wp-block-paragraph">
<li><strong>Question Types:</strong>
<ul>
<li>Approximately 76% of the questions are short answers (which means AI can’t just regurgitate facts like parakeets).</li>
<li>About 24% are multiple-choice (because nothing says “you’re trapped” quite like a question with options).</li>
<li>Roughly 14% are multimodal, which means they require the brainpower to analyze both text and images.</li>
</ul>
</li>
</ul>
<h4 id="h-difficulty-criteria" class="wp-block-heading" style="border-bottom: 2px solid #00c2ff; padding-bottom: 10px;">Difficulty Criteria</h4>
<p class="wp-block-paragraph">The questions aren’t your run-of-the-mill trivia. They are original, possess a single verifiable answer, and are designed to stump those cutting-edge large language models (LLMs). A meticulous filtering process culled ~70,000 questions to a mere 6,000, ultimately resulting in the final public and private sets.</p>
<ol class="wp-block-paragraph">
<li>Filtered from 70,000 to around 13,000 through expert peer review.</li>
<li>Shrunk to ~6,000 after manual approval.</li>
<li>Final split: 2,500 public and ~500 private questions.</li>
</ol>
<p><strong>The results were striking:</strong> even cutting-edge AI models stumbled on this exam. GPT-4o managed just 2.7% accuracy, while Claude 3.5 Sonnet scored 4.1%, and OpenAI’s o1 model topped out at roughly 8%. However, newer systems showed dramatic improvement—Gemini 3.1 Pro and Claude Opus 4.6 leaped to 40-50% accuracy, signaling rapid progress in the field.</p>
<h3 id="h-why-hle-is-crucial" class="wp-block-heading" style="border-bottom: 2px solid #00c2ff; padding-bottom: 10px;">Why HLE is Crucial: The Benchmark of Intelligence</h3>
<p class="wp-block-paragraph">While many benchmarks put AI’s capabilities on display, HLE takes matters a step further. It doesn’t just throw questions at AI models but assesses their capability to understand and work through complex reasoning tasks. Performance data reveals that even state-of-the-art LLMs fail to shine, showcasing low accuracy and a whopping gap between AI&#8217;s capabilities and human expertise.</p>
<p class="wp-block-paragraph">This is where it gets spicy. HLE isn’t just another box-ticking exercise; it offers a glimpse into the future of AI development. Here’s a handy comparison of benchmark tests to illustrate the unique nature of HLE:</p>
<table class="wp-block-table">
<thead>
<tr>
<th><strong>Benchmark Comparison</strong></th>
<th><strong>Focus</strong></th>
<th><strong>HLE Differentiation</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>MMLU</strong></td>
<td>57 subjects, zero-shot knowledge</td>
<td>Saturated; HLE emphasizes reasoning over recall.</td>
</tr>
<tr>
<td><strong>MMLU-Pro+</strong></td>
<td>Higher-order reasoning</td>
<td>HLE uses expert-curated, more challenging problems.</td>
</tr>
<tr>
<td><strong>GPQA</strong></td>
<td>Graduate-level STEM</td>
<td>HLE offers a broader range of subjects.</td>
</tr>
</tbody>
</table>
<h4 id="h-implications-for-the-future" class="wp-block-heading" style="border-bottom: 2px solid #00c2ff; padding-bottom: 10px;">Implications for the Future</h4>
<p class="wp-block-paragraph">HLE acts as a robust metric for tracking how far AI models actually progress. It’s a tool for scientists, policymakers, and educators to assess AI capabilities without implying that these systems possess full artificial general intelligence (AGI). Let’s face it: A high score on HLE doesn’t mean AIs are on the brink of leading revolutions. They might be great at formal exams but completely clueless about real-world nuances or the art of synthesizing disparate information.</p>
<h3 id="h-gotchas" class="wp-block-heading" style="border-bottom: 2px solid #00c2ff; padding-bottom: 10px;">Gotchas: The Limitations and Trade-offs</h3>
<p class="wp-block-paragraph">There’s no such thing as a free lunch, and HLE comes with its concerns. Testing structured problems is one thing; navigating the unpredictable waters of real-world scenarios is another. Critics argue that while HLE may serve as an impressive benchmark, it doesn’t capture the ability to handle messy, chaotic information that humans navigate instinctively every day.</p>
<p class="wp-block-paragraph">A major limitation is that HLE focuses on closed-ended questions, which don’t lend themselves to AI creativity or synthesis of information in novel ways. Moreover, high scores could signal &#8220;inhuman&#8221; reasoning — and isn&#8217;t that what we need to worry about? Who wants an overconfident AI spouting answers as if it were a know-it-all? It begs the question: How much intelligence is too much?</p>
<h3 id="h-whats-next" class="wp-block-heading" style="border-bottom: 2px solid #00c2ff; padding-bottom: 10px;">What’s Next: The Future of AI Testing</h3>
<p class="wp-block-paragraph">The future holds intriguing possibilities. As discussions around the implications of HLE unfold, it&#8217;s becoming clear that this assessment tool will be critical in evaluating AI&#8217;s role in education, safety, and beyond. Hosting the benchmark at <a style="color: #00c2ff !important;" href="https://agi.safe.ai" target="_blank" rel="noopener">agi.safe.ai</a> also opens avenues for educators and curious minds to engage with these public questions and potentially craft fresh, innovative learning experiences.</p>
<p class="wp-block-paragraph">More research and iterations are needed, especially in exploring how well AI can integrate disparate pieces of information and engage in creative solutions. As AI models grow ever more sophisticated, the means of testing their capabilities must evolve with sophistication that matches their potential.</p>
<p class="wp-block-paragraph">In essence, HLE is not the end but rather the beginning of a comprehensive understanding of AI capabilities — a vital stepping stone toward figuring out just how smart these artificial minds can get. If the last exam is a sign of what&#8217;s to come, the future of AI tests is bound to be as unpredictable as a cat on a hot tin roof.</p>
<h3 id="h-faq" class="wp-block-heading" style="border-bottom: 2px solid #00c2ff; padding-bottom: 10px;">FAQ</h3>
<ul class="wp-block-paragraph">
<li><strong>What is HLE?</strong> HLE stands for Humanity&#8217;s Last Exam, a benchmark aimed at assessing AI&#8217;s advanced reasoning capabilities.</li>
<li><strong>How many questions are in HLE?</strong> HLE consists of 2,500 public questions and ~500 holdout questions.</li>
<li><strong>What subjects does HLE cover?</strong> It spans a range of disciplines including mathematics, natural sciences, humanities, and more.</li>
<li><strong>Why is HLE important?</strong> It challenges AI models to demonstrate understanding and problem-solving skills rather than mere recall.</li>
<li><strong>What are the limitations of HLE?</strong> Critics argue it does not effectively evaluate AI&#8217;s ability to navigate real-world scenarios and may focus too heavily on closed-ended questions.</li>
</ul>
<p>The post <a rel="nofollow" href="https://freakynewguy.net/humanitys-last-exam-ai-test/">Exploring Humanity&#8217;s Last Exam for AI Intelligence Assessment</a> appeared first on <a rel="nofollow" href="https://freakynewguy.net">That Freaky NewGuy</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://freakynewguy.net/humanitys-last-exam-ai-test/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">1360</post-id>	</item>
	</channel>
</rss>
