# huggingface/hle (Humanity's Last Exam) Evaluate LLMs against [Humanity's Last Exam (HLE)](https://arxiv.org/abs/2501.14249), a challenging benchmark created by 1,000+ experts across 500+ institutions. HLE features 3,000+ questions spanning 100+ subjects, designed to push AI capabilities to their limits. **📖 [Read the complete HLE benchmark guide →](https://www.promptfoo.dev/docs/guides/hle-benchmark/)** You can run this example with: ```bash npx promptfoo@latest init --example huggingface/hle cd huggingface/hle ``` ## Prerequisites - OpenAI API key set as `OPENAI_API_KEY` - Anthropic API key set as `ANTHROPIC_API_KEY` - Hugging Face access token (required for dataset access) ## Setup Set your Hugging Face token: ```bash export HF_TOKEN=your_token_here ``` Or add it to your `.env` file: ```env HF_TOKEN=your_token_here ``` Get your token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). ## Run the Evaluation Run the evaluation: ```bash npx promptfoo@latest eval ``` View results: ```bash npx promptfoo@latest view ``` ## What's Tested This evaluation tests models on: - Advanced mathematics and sciences - Humanities and social sciences - Professional domain knowledge - Multimodal reasoning - Interdisciplinary topics Each question is evaluated for accuracy using an LLM judge that compares the model's response against the verified correct answer. ## Current AI Performance HLE is designed to be extremely challenging. Recent model performance: - **OpenAI Deep Research**: 26.6% accuracy - **o4-mini**: 18.1% accuracy - **DeepSeek-R1**: 9.4% accuracy Low scores are expected - this benchmark represents the cutting edge of AI evaluation. ## Customization ### Test More Questions Increase the sample size: ```yaml tests: - huggingface://datasets/cais/hle?split=test&limit=100 ``` ### Add More Models Compare multiple providers: ```yaml providers: - anthropic:claude-sonnet-4-6 - openai:o4-mini - deepseek:deepseek-reasoner ``` ### Different Prompting Try alternative prompting strategies by modifying `prompt.py` or using static prompts: ```yaml prompts: - 'Answer this question step by step: {{question}}' - file://prompt.py:create_hle_prompt ``` ## Resources - [HLE Paper](https://arxiv.org/abs/2501.14249) - [HLE Dataset](https://huggingface.co/datasets/cais/hle) - [Promptfoo Documentation](https://promptfoo.dev/docs/getting-started)