HALLMARK: Citation Hallucination Benchmark

Measuring how well tools detect hallucinated citations in ML papers

HALLMARK is a citation hallucination benchmark for evaluating reference-verification tools on ML papers. It provides 2,525 annotated entries spanning 14 hallucination types across 3 difficulty tiers, with 6 sub-tests per entry for fine-grained diagnosis.

2,525
Annotated Entries
14
Hallucination Types
3
Difficulty Tiers
6
Sub-tests per Entry

Difficulty Tiers

Easy Medium Hard

Tiers reflect how difficult a hallucination is to detect: Easy entries have obvious errors (e.g., nonexistent DOIs), Medium entries require cross-referencing metadata, and Hard entries involve subtle misattributions that demand deep semantic understanding of the cited work.

Built-in Baselines

Metrics

Final system rankings use Plackett–Luce ranking to aggregate across metrics and tiers into a single ordering.

Design Principles

HALLMARK draws on best practices from established code and ML benchmarks: HumanEval (functional correctness via test cases), SWE-bench (real-world task grounding), LiveCodeBench (contamination-resistant temporal splits), and ONEBench (multi-metric aggregation into a unified score).

Leaderboard

F1-H test_public

Ranked by F1-Hallucination (primary metric) on test_public split (831 entries). Arrows indicate preferred direction.

# Baseline Type DR ↑ F1-H ↑ TW-F1 ↑ FPR ↓ ECE ↓
1 Claude Sonnet 4.6 + BibTeX-Updater (agentic) LLM 0.965 0.901 0.937 0.334 0.071
2 GPT-5.1 + BibTeX-Updater (agentic) LLM 0.938 0.885 0.921 0.345 0.054
3 bibtex-updater TOOL 0.886 0.858 0.892 0.338 0.265
4 GPT-5.1 (agentic) LLM 0.943 0.850 0.903 0.523 0.137
5 GPT-5.1 (tool-augmented) LLM 0.829 0.845 0.862 0.254 0.104
6 Claude Opus 4.7 LLM 0.735 0.832 0.851 0.059 0.126
7 Qwen 3 Max LLM 0.901 0.803 0.862 0.652 0.247
8 DeepSeek R1 LLM 0.781 0.802 0.820 0.330 0.202
9 Qwen 3 235B LLM 0.888 0.799 0.859 0.631 0.249
10 Mistral Large LLM 0.658 0.724 0.732 0.303 0.289
11 Llama 4 Maverick LLM 0.604 0.711 0.700 0.178 0.207
12 Gemini 2.5 Flash LLM 0.483 0.627 0.606 0.111 0.335
13 Gemini 2.5 Pro LLM 0.437 0.594 0.582 0.064 0.382
14 DOI-only TOOL 0.369 0.483 0.436 0.303 0.148