HALLMARK: Citation Hallucination Benchmark

HALLMARK is a citation hallucination benchmark for evaluating reference-verification tools on ML papers. It provides 2,525 annotated entries spanning 14 hallucination types across 3 difficulty tiers, with 6 sub-tests per entry for fine-grained diagnosis.

2,525

Annotated Entries

Hallucination Types

Difficulty Tiers

Sub-tests per Entry

Difficulty Tiers

Easy Medium Hard

Tiers reflect how difficult a hallucination is to detect: Easy entries have obvious errors (e.g., nonexistent DOIs), Medium entries require cross-referencing metadata, and Hard entries involve subtle misattributions that demand deep semantic understanding of the cited work.

Built-in Baselines

DOI-only — resolves DOIs and flags missing or broken links
bibtex-updater — metadata-based verification via BibTeX normalization
LLM-based — direct LLM prompting for citation plausibility
Ensemble — combines DOI + metadata + LLM signals
HaRC — hallucinated reference classification baseline
verify-citations — end-to-end citation verification pipeline

Metrics

Detection Rate — fraction of hallucinated citations correctly flagged
F1 — harmonic mean of precision and recall
Tier-weighted F1 — F1 weighted by difficulty tier (Hard counts more)
detect@k — detection rate within the top-k ranked citations
ECE — expected calibration error of confidence scores

Final system rankings use Plackett–Luce ranking to aggregate across metrics and tiers into a single ordering.

Design Principles

HALLMARK draws on best practices from established code and ML benchmarks: HumanEval (functional correctness via test cases), SWE-bench (real-world task grounding), LiveCodeBench (contamination-resistant temporal splits), and ONEBench (multi-metric aggregation into a unified score).

Leaderboard

F1-H test_public

Ranked by F1-Hallucination (primary metric) on test_public split (831 entries). Arrows indicate preferred direction.

#	Baseline	Type	DR ↑	F1-H ↑	TW-F1 ↑	FPR ↓	ECE ↓
1	Claude Sonnet 4.6 + BibTeX-Updater (agentic)	LLM	0.965	0.901	0.937	0.334	0.071
2	GPT-5.1 + BibTeX-Updater (agentic)	LLM	0.938	0.885	0.921	0.345	0.054
3	bibtex-updater	TOOL	0.886	0.858	0.892	0.338	0.265
4	GPT-5.1 (agentic)	LLM	0.943	0.850	0.903	0.523	0.137
5	GPT-5.1 (tool-augmented)	LLM	0.829	0.845	0.862	0.254	0.104
6	Claude Opus 4.7	LLM	0.735	0.832	0.851	0.059	0.126
7	Qwen 3 Max	LLM	0.901	0.803	0.862	0.652	0.247
8	DeepSeek R1	LLM	0.781	0.802	0.820	0.330	0.202
9	Qwen 3 235B	LLM	0.888	0.799	0.859	0.631	0.249
10	Mistral Large	LLM	0.658	0.724	0.732	0.303	0.289
11	Llama 4 Maverick	LLM	0.604	0.711	0.700	0.178	0.207
12	Gemini 2.5 Flash	LLM	0.483	0.627	0.606	0.111	0.335
13	Gemini 2.5 Pro	LLM	0.437	0.594	0.582	0.064	0.382
14	DOI-only	TOOL	0.369	0.483	0.436	0.303	0.148