HALLMARK is a citation hallucination benchmark for evaluating reference-verification tools on ML papers. It provides 2,525 annotated entries spanning 14 hallucination types across 3 difficulty tiers, with 6 sub-tests per entry for fine-grained diagnosis.
Difficulty Tiers
Tiers reflect how difficult a hallucination is to detect: Easy entries have obvious errors (e.g., nonexistent DOIs), Medium entries require cross-referencing metadata, and Hard entries involve subtle misattributions that demand deep semantic understanding of the cited work.
Built-in Baselines
- DOI-only — resolves DOIs and flags missing or broken links
- bibtex-updater — metadata-based verification via BibTeX normalization
- LLM-based — direct LLM prompting for citation plausibility
- Ensemble — combines DOI + metadata + LLM signals
- HaRC — hallucinated reference classification baseline
- verify-citations — end-to-end citation verification pipeline
Metrics
- Detection Rate — fraction of hallucinated citations correctly flagged
- F1 — harmonic mean of precision and recall
- Tier-weighted F1 — F1 weighted by difficulty tier (Hard counts more)
- detect@k — detection rate within the top-k ranked citations
- ECE — expected calibration error of confidence scores
Final system rankings use Plackett–Luce ranking to aggregate across metrics and tiers into a single ordering.
Design Principles
HALLMARK draws on best practices from established code and ML benchmarks: HumanEval (functional correctness via test cases), SWE-bench (real-world task grounding), LiveCodeBench (contamination-resistant temporal splits), and ONEBench (multi-metric aggregation into a unified score).
Leaderboard
Ranked by F1-Hallucination (primary metric) on dev_public split (1,119 entries).
Arrows indicate preferred direction.
| # | Baseline | Type | DR ↑ | F1-H ↑ | TW-F1 ↑ | FPR ↓ | ECE ↓ |
|---|---|---|---|---|---|---|---|
| 1 | bibtex-updater | TOOL | 0.946 | 0.908 | 0.936 | 0.179 | 0.297 |
| 2 | GPT-5.1 | LLM | 0.797 | 0.822 | 0.846 | 0.171 | 0.107 |
| 3 | Qwen 3 235B | LLM | 0.832 | 0.737 | 0.806 | 0.551 | 0.294 |
| 4 | DeepSeek R1 | LLM | 0.871 | 0.737 | 0.814 | 0.640 | 0.247 |
| 5 | Mistral Large | LLM | 0.691 | 0.731 | 0.743 | 0.258 | 0.247 |
| 6 | DeepSeek V3 | LLM | 0.880 | 0.721 | 0.805 | 0.730 | 0.331 |
| 7 | Gemini 2.5 Flash | LLM | 0.482 | 0.617 | 0.608 | 0.101 | 0.286 |
| 8 | DOI-only | TOOL | 0.256 | 0.361 | 0.314 | 0.195 | 0.143 |
| 9 | HaRC | TOOL | 0.143 | 0.250 | 0.165 | 0.002 | 0.011 |
| 10 | verify-citations | TOOL | 0.300 | 0.240 | 0.302 | 0.133 | — |