Published benchmark
Trust should be measured.
Verifex publishes benchmark methodology for sanctions matching, transliteration, aliases, false positives, and edge cases. Every result is inspectable.
F1 Score
On our published benchmark
Recall
0 false negatives
Precision
6 false positives
p50 Latency
p99 276ms
Verified May 2, 2026 · 122-case published benchmark: 96.6% F1, 100% recall, 93.4% precision.
Reflects the v2 production engine. v3 runs in shadow-only mode and is not yet used for live decisions.
Results at a glance
What held up. What needs work.
122 cases across 15 categories.13 categories passed cleanly. 2 categories show room for improvement.
Passed cleanly (13/15)
Official-list names matched cleanly.
Typos and minor spelling variants held up.
Cross-script transliteration remained strong.
German/French transliterations all matched.
Phonetic variants remained intact.
Reordered names still resolved.
Entity aliases and acronyms held.
PEP coverage passed cleanly.
Cross-jurisdiction matches stayed intact.
Distinctive surname-only lookups passed.
False substring matches were suppressed.
Invented company names cleared correctly.
Phase 1 FP reduction: all common-name cases now clear correctly with surname-mismatch and evidence-gate penalties.
Partial pass (2/15)
Two adversarial inputs produce low-confidence PEP near-matches (conf 42, AUTO_CLEAR). Not sanctions matches.
Four invented-name cases produce low-confidence PEP near-matches (conf 42, AUTO_CLEAR). Intentionally tolerated — see FP cluster.
Example cases
How the pipeline handles real inputs.
Vladimir Putin
Exact name on OFAC SDN list. High confidence, primary sanctions source.
بوتين
Arabic script transliteration matched to Latin-script OFAC entry via cross-script pipeline.
V. Putin
Initial + surname matched via alias expansion and token reordering.
John Smith
Common name with no distinguishing fields. Correctly cleared with low confidence. No review required.
Myrtalee Quennford
Phonetic near-match to PEP entry. AUTO_CLEAR with human_review_required: false. Intentionally tolerated to protect recall.
False-positive cluster
Borderline cases we publish, not hide.
Six borderline cases remain after Phase 2 FP reduction (May 2026). All six are low-confidence PEP near-matches at exactly confidence 42 — the minimum match threshold. They are not sanctions block recommendations: the adjudication engine marks all six as AUTO_CLEAR with human_review_required: false. Suppressing them would risk false negatives on real phonetic PEP matches.
PEP phonetic near-match (Zandra Maulen Jofre). AUTO_CLEAR.
PEP fuzzy near-match (Thoralf Heimdal). AUTO_CLEAR.
PEP phonetic near-match (Myrtle Cole). AUTO_CLEAR.
PEP phonetic near-match (Freitas Junior). AUTO_CLEAR.
PEP phonetic near-match (Mike Testa). AUTO_CLEAR.
PEP fuzzy near-match (Héctor Pérez Plazola). AUTO_CLEAR.
Pipeline
What runs behind every screen.
Normalize & Generate
Unicode normalization, transliteration handling, lexical search, and semantic candidate generation across the configured active screening index.
Multi-algorithm Score
Exact, fuzzy, phonetic, and contextual scoring work together instead of relying on a single similarity metric.
FP Suppression
Penalty layers reduce business-name traps, entity/person mismatches, and noisy token collisions.
Structured Adjudicate
Ambiguous cases receive structured verdicts (BLOCK / REVIEW / CLEAR) with documented rationale, not opaque scores.
Methodology
How to interpret the numbers.
Recall is the priority metric: zero false negatives on the verified run. A missed sanctioned entity is a compliance failure. Precision is important, but never at the cost of recall.
Phase 1 FP reduction (May 2026): surname-mismatch penalty, common-name evidence gate, and stable sort eliminated all common-name false positives.
Phase 2 FP reduction (May 2026): entity-name FPs eliminated. Generic business names (Global Trading LLC, Eastern Star Consulting) now correctly clear via Strategy 3 gate and Penalty 14.
Remaining 6 FPs are intentionally tolerated borderline cases: all are low-confidence PEP near-matches at confidence 42 that the API already marks as AUTO_CLEAR with human_review_required: false. Suppressing them risks recall loss on real phonetic PEP variants.
Latency is acceptable for production: p50 45ms, p95 202ms, p99 276ms on the clean rerun.
The value of publishing this page is transparency, not claiming a perfect score. These numbers will change as usage data informs further tuning.
Context
How this compares to published research.
These numbers are useful for context, but the datasets are not directly comparable. The point of this page is transparency, not pretending one benchmark can replace every other one.
| System | F1 | Recall | Precision | Notes |
|---|---|---|---|---|
| Verifex (published benchmark) | 96.6% | 100% | 93.4% | 122 sanctions-specific cases. Self-administered. |
| GPT-4o (Published Pairs Dataset) | 98.95% | — | — | 755K labeled pairs, different dataset |
| DeepSeek-R1 14B (Published Pairs Dataset) | 98.23% | — | — | 755K labeled pairs, different dataset |
| Rule-based baseline (Published Pairs Dataset) | 91.33% | — | — | Published fuzzy baseline |
Transparency
Why benchmark transparency matters.
Auditable claims
Every accuracy claim on this page is tied to a reproducible test case. No marketing metrics without source data.
Failure disclosure
We publish the exact queries that produce false positives, not just the headline score. Compliance teams need to know edge cases.
Continuous improvement
The benchmark is rerun after every major pipeline change. Scores that go down are published alongside scores that go up.