HALLMARK is a citation hallucination benchmark for evaluating reference-verification tools on ML papers. It provides 2,525 annotated entries spanning 14 hallucination types across 3 difficulty tiers, with 6 sub-tests per entry for fine-grained diagnosis.
Difficulty Tiers
Tiers reflect how difficult a hallucination is to detect: Easy entries have obvious errors (e.g., nonexistent DOIs), Medium entries require cross-referencing metadata, and Hard entries involve subtle misattributions that demand deep semantic understanding of the cited work.
Built-in Baselines
- DOI-only — resolves DOIs and flags missing or broken links
- bibtex-updater — metadata-based verification via BibTeX normalization
- LLM-based — direct LLM prompting for citation plausibility
- Ensemble — combines DOI + metadata + LLM signals
- HaRC — hallucinated reference classification baseline
- verify-citations — end-to-end citation verification pipeline
Metrics
- Detection Rate — fraction of hallucinated citations correctly flagged
- F1 — harmonic mean of precision and recall
- Tier-weighted F1 — F1 weighted by difficulty tier (Hard counts more)
- detect@k — detection rate within the top-k ranked citations
- ECE — expected calibration error of confidence scores
Final system rankings use Plackett–Luce ranking to aggregate across metrics and tiers into a single ordering.
Design Principles
HALLMARK draws on best practices from established code and ML benchmarks: HumanEval (functional correctness via test cases), SWE-bench (real-world task grounding), LiveCodeBench (contamination-resistant temporal splits), and ONEBench (multi-metric aggregation into a unified score).
Leaderboard
Ranked by F1-Hallucination (primary metric) on test_public split (831 entries).
Arrows indicate preferred direction.
| # | Baseline | Type | DR ↑ | F1-H ↑ | TW-F1 ↑ | FPR ↓ | ECE ↓ |
|---|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 4.6 + BibTeX-Updater (agentic) | LLM | 0.965 | 0.901 | 0.937 | 0.334 | 0.071 |
| 2 | GPT-5.1 + BibTeX-Updater (agentic) | LLM | 0.938 | 0.885 | 0.921 | 0.345 | 0.054 |
| 3 | bibtex-updater | TOOL | 0.886 | 0.858 | 0.892 | 0.338 | 0.265 |
| 4 | GPT-5.1 (agentic) | LLM | 0.943 | 0.850 | 0.903 | 0.523 | 0.137 |
| 5 | GPT-5.1 (tool-augmented) | LLM | 0.829 | 0.845 | 0.862 | 0.254 | 0.104 |
| 6 | Claude Opus 4.7 | LLM | 0.735 | 0.832 | 0.851 | 0.059 | 0.126 |
| 7 | Qwen 3 Max | LLM | 0.901 | 0.803 | 0.862 | 0.652 | 0.247 |
| 8 | DeepSeek R1 | LLM | 0.781 | 0.802 | 0.820 | 0.330 | 0.202 |
| 9 | Qwen 3 235B | LLM | 0.888 | 0.799 | 0.859 | 0.631 | 0.249 |
| 10 | Mistral Large | LLM | 0.658 | 0.724 | 0.732 | 0.303 | 0.289 |
| 11 | Llama 4 Maverick | LLM | 0.604 | 0.711 | 0.700 | 0.178 | 0.207 |
| 12 | Gemini 2.5 Flash | LLM | 0.483 | 0.627 | 0.606 | 0.111 | 0.335 |
| 13 | Gemini 2.5 Pro | LLM | 0.437 | 0.594 | 0.582 | 0.064 | 0.382 |
| 14 | DOI-only | TOOL | 0.369 | 0.483 | 0.436 | 0.303 | 0.148 |