HALLMARK: Citation Hallucination Benchmark

Measuring how well tools detect hallucinated citations in ML papers

HALLMARK is a citation hallucination benchmark for evaluating reference-verification tools on ML papers. It provides 2,525 annotated entries spanning 14 hallucination types across 3 difficulty tiers, with 6 sub-tests per entry for fine-grained diagnosis.

2,525
Annotated Entries
14
Hallucination Types
3
Difficulty Tiers
6
Sub-tests per Entry

Difficulty Tiers

Easy Medium Hard

Tiers reflect how difficult a hallucination is to detect: Easy entries have obvious errors (e.g., nonexistent DOIs), Medium entries require cross-referencing metadata, and Hard entries involve subtle misattributions that demand deep semantic understanding of the cited work.

Built-in Baselines

Metrics

Final system rankings use Plackett–Luce ranking to aggregate across metrics and tiers into a single ordering.

Design Principles

HALLMARK draws on best practices from established code and ML benchmarks: HumanEval (functional correctness via test cases), SWE-bench (real-world task grounding), LiveCodeBench (contamination-resistant temporal splits), and ONEBench (multi-metric aggregation into a unified score).

Leaderboard

F1-H dev_public

Ranked by F1-Hallucination (primary metric) on dev_public split (1,119 entries). Arrows indicate preferred direction.

# Baseline Type DR ↑ F1-H ↑ TW-F1 ↑ FPR ↓ ECE ↓
1 bibtex-updater TOOL 0.946 0.908 0.936 0.179 0.297
2 GPT-5.1 LLM 0.797 0.822 0.846 0.171 0.107
3 Qwen 3 235B LLM 0.832 0.737 0.806 0.551 0.294
4 DeepSeek R1 LLM 0.871 0.737 0.814 0.640 0.247
5 Mistral Large LLM 0.691 0.731 0.743 0.258 0.247
6 DeepSeek V3 LLM 0.880 0.721 0.805 0.730 0.331
7 Gemini 2.5 Flash LLM 0.482 0.617 0.608 0.101 0.286
8 DOI-only TOOL 0.256 0.361 0.314 0.195 0.143
9 HaRC TOOL 0.143 0.250 0.165 0.002 0.011
10 verify-citations TOOL 0.300 0.240 0.302 0.133