Internal benchmark

Trust should be measured.

Verifex publishes benchmark methodology for sanctions matching, calibrated confidence scoring, transliteration, aliases, false positives, and edge cases. Engine v3 benchmark results are published below; production screening uses the validated v3 pipeline as the primary production engine.

Engine v3

Probabilistic scoring. Calibrated confidence.

Log-odds aggregation with Fellegi-Sunter contextual weighting. Temperature-scaled sigmoid (T = 0.20). Context-gating penalties. Every score is a probability, not a heuristic.

99.68%

F1 Score

500-case synthetic

99.36%

Recall

3 false negatives

100%

Precision

0 false positives

1.36%

Calibration Error

T = 0.2

500-case benchmark: 99.68% F1, 99.36% recall, 100% precision, verified May 17, 2026. · Temperature scaling · Context-gating penalties

Benchmark suites

Choose your lens.

Engine v3 is validated across three internal suites plus an adversarial safety test. The 500-case synthetic benchmark is our primary public claim.

500-case Synthetic

Verified May 17, 2026

99.68%

F1 Score

99.36%

Recall

100%

Precision

500

Test cases

463

True positives

True negatives

False positives

False negatives

1.36%

ECE

0.81%

Brier

The primary internal benchmark for Engine v3. 500 synthetic cases covering exact match, spelling, transliteration, phonetic, word order, entity aliases, PEP, multi-source, partial names, substring traps, generic business, adversarial, innocent, and common name categories.

Self-administered benchmark disclosure: This benchmark was designed and executed by Verifex on a test suite we authored. It has not been independently audited by a third party. Results reflect performance on our published test set, not a guarantee of identical accuracy across all production data. Read the full matching methodology.

Results at a glance

What held up. What needs work.

500 v3 cases across 13 categories.11 categories passed cleanly. 2 categories show room for improvement.

Passed cleanly (11/13)

Exact match29/29

Official-list names matched at full confidence with calibrated probability via log-odds aggregation.

Case & formatting118/118

Lowercase, uppercase, comma formatting, and hyphenation variants all resolved correctly.

Token reorder & drop56/56

Reordered names and dropped tokens still matched via fuzzy token alignment and stable sort.

Typos & spelling78/78

Single-character typos and commatization variants resolved via edit-distance pipeline.

Name enrichment79/79

Middle initials, patronymics, and name expansions matched via token-flexible scoring.

Alias variants19/19

Known aliases, company aliases, and substitution variants resolved.

Transliteration20/20

Cross-script matches (Arabic, Cyrillic, Latin) held up across transliteration boundaries.

Entity & UBO15/15

Company names, acronyms, and beneficial ownership links resolved.

PEP distinction10/10

PEP vs sanctions distinction, non-sanctioned PEPs, and sanctioned PEPs all classified correctly.

False-positive safety32/32

Common names and source canaries correctly cleared via evidence-gate and surname-mismatch penalties. Zero false positives.

Debarment & watchlist9/9

Non-sanctions lists (debarment, watchlist) and identifier matches resolved correctly.

Partial pass (2/13)

Context gating25/27 (93%)

DOB, country, and entity-type mismatches suppressed via context-gating penalty (−1.2 log-odds). 2 false negatives on highly ambiguous boundary cases.

Evasion detection7/8 (88%)

Zero-width space and synthetic evasion patterns detected and flagged. 1 false negative on an advanced evasion pattern.

Example cases

How the pipeline handles real inputs.

Exact matchconfirmed_match

Rosneft

Exact name on OFAC SDN list. High confidence, primary sanctions source. v3 produces calibrated probability via log-odds aggregation.

confidence: 0.99

Transliterationpossible_match

بوتين

Arabic script transliteration matched to Latin-script OFAC entry via cross-script pipeline. Confidence reflects script-distance uncertainty.

confidence: 0.87

Alias variantpossible_match

R. Neft

Initial + surname matched via alias expansion and token reordering. Context-gating boosts confidence because DOB and country align.

confidence: 0.82

Typo tolerancepossible_match

Rosnefft

Single-character transposition typo matched via Damerau-Levenshtein distance. Context-gating confirms alignment with DOB and country evidence.

confidence: 0.91

False positive / auto-clearclear

John Smith

Common name with no distinguishing fields. Context-gating penalty suppresses confidence. Correctly cleared. No review required.

confidence: 0.12

Entity aliasconfirmed_match

Rosneft Oil Co

Entity alias matched via acronym expansion and token normalization. Penalty 14 (company evidence gate) confirms alignment.

confidence: 0.95

False-positive cluster

Borderline cases we publish, not hide.

Engine v3 has zero false positives on the 500-case synthetic benchmark. The cases below are from the legacy 122-case v2 benchmark and are retained for transparency. All six are low-confidence PEP near-matches at exactly confidence 42 — the minimum match threshold. They are not sanctions block recommendations: the adjudication engine marks all six as AUTO_CLEAR with human_review_required: false. Suppressing them would risk false negatives on real phonetic PEP matches.

Zandralina PoffwickLegacy v242% confidence

PEP phonetic near-match (Zandra Maulen Jofre). AUTO_CLEAR. From 122-case legacy benchmark. Retained for transparency.

Thorblast GrimmjawLegacy v242% confidence

PEP fuzzy near-match (Thoralf Heimdal). AUTO_CLEAR. From 122-case legacy benchmark. Retained for transparency.

Myrtalee QuennfordLegacy v242% confidence

PEP phonetic near-match (Myrtle Cole). AUTO_CLEAR. From 122-case legacy benchmark. Retained for transparency.

Jurbinka FretzelmannLegacy v242% confidence

PEP phonetic near-match (Freitas Junior). AUTO_CLEAR. From 122-case legacy benchmark. Retained for transparency.

Test UserLegacy v242% confidence

PEP phonetic near-match (Mike Testa). AUTO_CLEAR. From 122-case legacy benchmark. Retained for transparency.

Placeholder NameLegacy v242% confidence

PEP fuzzy near-match (Héctor Pérez Plazola). AUTO_CLEAR. From 122-case legacy benchmark. Retained for transparency.

Pipeline

What runs behind every screen.

Normalize & Generate

Unicode normalization, transliteration handling, lexical search, and semantic candidate generation across the configured active screening index.

Multi-algorithm Score

Exact, fuzzy, phonetic, and contextual scoring work together. Log-odds aggregation with Fellegi-Sunter contextual weighting replaces single-threshold heuristics.

FP Suppression

Penalty layers reduce business-name traps, entity/person mismatches, and noisy token collisions. Context-gating penalty: −1.2 log-odds when DOB, country, or identifier evidence is absent.

Structured Adjudicate

Ambiguous cases receive structured verdicts (BLOCK / ESCALATE / REVIEW / CLEAR) with documented rationale. Temperature-scaled confidence (T = 0.20) produces calibrated probabilities, not opaque scores.

Methodology

How to interpret the numbers.

Recall is the priority metric: 99.36% on 500 cases with only 3 false negatives. A missed sanctioned entity is a compliance failure. Precision is important, but never at the cost of recall.

Temperature scaling (T = 0.20) makes the sigmoid ~5× steeper, fixing underconfidence on edge cases. ECE of 1.36% means predicted confidence aligns with observed accuracy.

Context-gating penalty prevents high-confidence matches on ambiguous common names without supporting evidence (DOB, country, identifier). This is what drives the zero false positive rate on 500 cases.

Phase 1 FP reduction (May 2026): surname-mismatch penalty, common-name evidence gate, and stable sort eliminated all common-name false positives.

Phase 2 FP reduction (May 2026): entity-name FPs eliminated. Generic business names (Global Trading LLC, Eastern Star Consulting) now correctly clear via Strategy 3 gate and Penalty 14.

The 6 legacy borderline cases (v2, 122-case) are retained for transparency: all are low-confidence PEP near-matches at confidence 42 that the API marks as AUTO_CLEAR with human_review_required: false. Suppressing them would risk recall loss on real phonetic PEP variants.

Latency: sub-second response times in production use. We're holding off on publishing specific percentile figures until we've run and reconciled a fresh, independently-verified load test — an earlier informal measurement was inconsistent with prior published numbers, and we'd rather show nothing than show a number we can't stand behind.

The value of publishing this page is transparency, not claiming a perfect score. These numbers will change as usage data informs further tuning.

Adversarial safety

Conservative by design.

The adversarial suite tests 10 corruption categories on 247 cases.37.79% recall on severely corrupted inputs is intentional — the primary goal is a 0% danger score: no false positives on corrupted common names.

37.79%

Recall

100%

Precision

Danger Score

247

Cases

Per-category recall on corrupted inputs

Token Reorder100.00% · 12 cases

Mixed Script80.00% · 40 cases

Transliteration80.00% · 5 cases

Nickname66.67% · 6 cases

Keyboard35.00% · 50 cases

OCR33.33% · 70 cases

Spacing12.50% · 40 cases

Contamination5.00% · 20 cases

Org Suffix0.00% · 2 cases

Token Omit0.00% · 2 cases

Low recall on severe corruption (contamination, org suffix, token omission) is expected and acceptable. These cases represent inputs so degraded that matching them would require accepting high false-positive risk on innocent similar names. The engine prioritizes safety over recall on unrecoverable corruption.

Context

How this compares to published research.

These numbers are useful for context, but the datasets are not directly comparable. The point of this page is transparency, not pretending one benchmark can replace every other one.

System	F1	Recall	Precision	Notes
Verifex Engine v3 (500-case)	99.68%	99.36%	100%	500 synthetic cases. Probabilistic scoring. Self-administered.
Verifex v2 (122-case legacy)	96.6%	100%	93.4%	122 cases. Rule-based pipeline. Retained for comparison.
GPT-4o (Published Pairs Dataset)	98.95%	—	—	755K labeled pairs, different dataset
DeepSeek-R1 14B (Published Pairs Dataset)	98.23%	—	—	755K labeled pairs, different dataset
Rule-based baseline (Published Pairs Dataset)	91.33%	—	—	Published fuzzy baseline

Transparency

Why benchmark transparency matters.

Auditable claims

Every accuracy claim on this page is tied to a reproducible test case. No marketing metrics without source data.

Failure disclosure

We publish the exact queries that produce false positives, not just the headline score. Compliance teams need to know edge cases.

Continuous improvement

The benchmark is rerun after every major pipeline change. Scores that go down are published alongside scores that go up.

Get started

Run your first screen.

Start with 50 free screens. No credit card. See the benchmark in action on your own data.