Published benchmark
Trust should be measured.
Verifex publishes benchmark methodology for sanctions matching, calibrated confidence scoring, transliteration, aliases, false positives, and edge cases. Engine v3 benchmark results are published below; production screening uses the validated v3 pipeline as the primary production engine.
Probabilistic scoring. Calibrated confidence.
Log-odds aggregation with Fellegi-Sunter contextual weighting. Temperature-scaled sigmoid (T = 0.20). Context-gating penalties. Every score is a probability, not a heuristic.
F1 Score
500-case synthetic
Recall
3 false negatives
Precision
0 false positives
Calibration Error
T = 0.2
500-case benchmark: 99.68% F1, 99.36% recall, 100% precision, verified May 17, 2026. · Temperature scaling · Context-gating penalties
Benchmark suites
Choose your lens.
Engine v3 is validated across three internal suites plus an adversarial safety test. The 500-case synthetic benchmark is our primary public claim.
500-case Synthetic
Verified May 17, 2026
The primary public benchmark for Engine v3. 500 synthetic cases covering exact match, spelling, transliteration, phonetic, word order, entity aliases, PEP, multi-source, partial names, substring traps, generic business, adversarial, innocent, and common name categories.
Results at a glance
What held up. What needs work.
500 v3 cases across 13 categories.11 categories passed cleanly. 2 categories show room for improvement.
Passed cleanly (11/13)
Official-list names matched at full confidence with calibrated probability via log-odds aggregation.
Lowercase, uppercase, comma formatting, and hyphenation variants all resolved correctly.
Reordered names and dropped tokens still matched via fuzzy token alignment and stable sort.
Single-character typos and commatization variants resolved via edit-distance pipeline.
Middle initials, patronymics, and name expansions matched via token-flexible scoring.
Known aliases, company aliases, and substitution variants resolved.
Cross-script matches (Arabic, Cyrillic, Latin) held up across transliteration boundaries.
Company names, acronyms, and beneficial ownership links resolved.
PEP vs sanctions distinction, non-sanctioned PEPs, and sanctioned PEPs all classified correctly.
Common names and source canaries correctly cleared via evidence-gate and surname-mismatch penalties. Zero false positives.
Non-sanctions lists (debarment, watchlist) and identifier matches resolved correctly.
Partial pass (2/13)
DOB, country, and entity-type mismatches suppressed via context-gating penalty (−1.2 log-odds). 2 false negatives on highly ambiguous boundary cases.
Zero-width space and synthetic evasion patterns detected and flagged. 1 false negative on an advanced evasion pattern.
Example cases
How the pipeline handles real inputs.
Vladimir Putin
Exact name on OFAC SDN list. High confidence, primary sanctions source. v3 produces calibrated probability via log-odds aggregation.
بوتين
Arabic script transliteration matched to Latin-script OFAC entry via cross-script pipeline. Confidence reflects script-distance uncertainty.
V. Putin
Initial + surname matched via alias expansion and token reordering. Context-gating boosts confidence because DOB and country align.
Vladimri Putin
Single-character transposition typo matched via Damerau-Levenshtein distance. Context-gating confirms alignment with DOB and country evidence.
John Smith
Common name with no distinguishing fields. Context-gating penalty suppresses confidence. Correctly cleared. No review required.
Rosneft Oil Co
Entity alias matched via acronym expansion and token normalization. Penalty 14 (company evidence gate) confirms alignment.
False-positive cluster
Borderline cases we publish, not hide.
Engine v3 has zero false positives on the 500-case synthetic benchmark. The cases below are from the legacy 122-case v2 benchmark and are retained for transparency. All six are low-confidence PEP near-matches at exactly confidence 42 — the minimum match threshold. They are not sanctions block recommendations: the adjudication engine marks all six as AUTO_CLEAR with human_review_required: false. Suppressing them would risk false negatives on real phonetic PEP matches.
PEP phonetic near-match (Zandra Maulen Jofre). AUTO_CLEAR. From 122-case legacy benchmark. Retained for transparency.
PEP fuzzy near-match (Thoralf Heimdal). AUTO_CLEAR. From 122-case legacy benchmark. Retained for transparency.
PEP phonetic near-match (Myrtle Cole). AUTO_CLEAR. From 122-case legacy benchmark. Retained for transparency.
PEP phonetic near-match (Freitas Junior). AUTO_CLEAR. From 122-case legacy benchmark. Retained for transparency.
PEP phonetic near-match (Mike Testa). AUTO_CLEAR. From 122-case legacy benchmark. Retained for transparency.
PEP fuzzy near-match (Héctor Pérez Plazola). AUTO_CLEAR. From 122-case legacy benchmark. Retained for transparency.
Pipeline
What runs behind every screen.
Normalize & Generate
Unicode normalization, transliteration handling, lexical search, and semantic candidate generation across the configured active screening index.
Multi-algorithm Score
Exact, fuzzy, phonetic, and contextual scoring work together. Log-odds aggregation with Fellegi-Sunter contextual weighting replaces single-threshold heuristics.
FP Suppression
Penalty layers reduce business-name traps, entity/person mismatches, and noisy token collisions. Context-gating penalty: −1.2 log-odds when DOB, country, or identifier evidence is absent.
Structured Adjudicate
Ambiguous cases receive structured verdicts (BLOCK / ESCALATE / REVIEW / CLEAR) with documented rationale. Temperature-scaled confidence (T = 0.20) produces calibrated probabilities, not opaque scores.
Methodology
How to interpret the numbers.
Recall is the priority metric: 99.36% on 500 cases with only 3 false negatives. A missed sanctioned entity is a compliance failure. Precision is important, but never at the cost of recall.
Temperature scaling (T = 0.20) makes the sigmoid ~5× steeper, fixing underconfidence on edge cases. ECE of 1.36% means predicted confidence aligns with observed accuracy.
Context-gating penalty prevents high-confidence matches on ambiguous common names without supporting evidence (DOB, country, identifier). This is what drives the zero false positive rate on 500 cases.
Phase 1 FP reduction (May 2026): surname-mismatch penalty, common-name evidence gate, and stable sort eliminated all common-name false positives.
Phase 2 FP reduction (May 2026): entity-name FPs eliminated. Generic business names (Global Trading LLC, Eastern Star Consulting) now correctly clear via Strategy 3 gate and Penalty 14.
The 6 legacy borderline cases (v2, 122-case) are retained for transparency: all are low-confidence PEP near-matches at confidence 42 that the API marks as AUTO_CLEAR with human_review_required: false. Suppressing them would risk recall loss on real phonetic PEP variants.
Latency remains acceptable for production: p50 45ms, p95 202ms, p99 276ms on the clean v3 rerun.
The value of publishing this page is transparency, not claiming a perfect score. These numbers will change as usage data informs further tuning.
Adversarial safety
Conservative by design.
The adversarial suite tests 10 corruption categories on 247 cases.37.79% recall on severely corrupted inputs is intentional — the primary goal is a 0% danger score: no false positives on corrupted common names.
Per-category recall on corrupted inputs
Low recall on severe corruption (contamination, org suffix, token omission) is expected and acceptable. These cases represent inputs so degraded that matching them would require accepting high false-positive risk on innocent similar names. The engine prioritizes safety over recall on unrecoverable corruption.
Context
How this compares to published research.
These numbers are useful for context, but the datasets are not directly comparable. The point of this page is transparency, not pretending one benchmark can replace every other one.
| System | F1 | Recall | Precision | Notes |
|---|---|---|---|---|
| Verifex Engine v3 (500-case) | 99.68% | 99.36% | 100% | 500 synthetic cases. Probabilistic scoring. Self-administered. |
| Verifex v2 (122-case legacy) | 96.6% | 100% | 93.4% | 122 cases. Rule-based pipeline. Retained for comparison. |
| GPT-4o (Published Pairs Dataset) | 98.95% | — | — | 755K labeled pairs, different dataset |
| DeepSeek-R1 14B (Published Pairs Dataset) | 98.23% | — | — | 755K labeled pairs, different dataset |
| Rule-based baseline (Published Pairs Dataset) | 91.33% | — | — | Published fuzzy baseline |
Transparency
Why benchmark transparency matters.
Auditable claims
Every accuracy claim on this page is tied to a reproducible test case. No marketing metrics without source data.
Failure disclosure
We publish the exact queries that produce false positives, not just the headline score. Compliance teams need to know edge cases.
Continuous improvement
The benchmark is rerun after every major pipeline change. Scores that go down are published alongside scores that go up.