Published benchmark

Trust should be measured.

Verifex publishes benchmark methodology for sanctions matching, transliteration, aliases, false positives, and edge cases. Every result is inspectable.

96.6%

F1 Score

On our published benchmark

100%

Recall

0 false negatives

93.4%

Precision

6 false positives

45ms

p50 Latency

p99 276ms

122
Test cases
15
Categories
85
True positives
31
True negatives
6
False positives
0
False negatives

Verified May 2, 2026 · 122-case published benchmark: 96.6% F1, 100% recall, 93.4% precision.

Reflects the v2 production engine. v3 runs in shadow-only mode and is not yet used for live decisions.

Self-administered benchmark disclosure: This benchmark was designed and executed by Verifex on a test suite we authored. It has not been independently audited by a third party. Results reflect performance on our published test set, not a guarantee of identical accuracy across all production data. Read the full matching methodology.

Results at a glance

What held up. What needs work.

122 cases across 15 categories.13 categories passed cleanly. 2 categories show room for improvement.

Passed cleanly (13/15)

Exact15/15

Official-list names matched cleanly.

Spelling10/10

Typos and minor spelling variants held up.

Arabic11/11

Cross-script transliteration remained strong.

Cyrillic5/5

German/French transliterations all matched.

Phonetic4/4

Phonetic variants remained intact.

Word Order8/8

Reordered names still resolved.

Entity12/12

Entity aliases and acronyms held.

PEP12/12

PEP coverage passed cleanly.

Multi Source5/5

Cross-jurisdiction matches stayed intact.

Partial3/3

Distinctive surname-only lookups passed.

Substring Trap9/9

False substring matches were suppressed.

Generic Business8/8

Invented company names cleared correctly.

Common Name5/5

Phase 1 FP reduction: all common-name cases now clear correctly with surname-mismatch and evidence-gate penalties.

Partial pass (2/15)

Adversarial3/5 (60%)

Two adversarial inputs produce low-confidence PEP near-matches (conf 42, AUTO_CLEAR). Not sanctions matches.

Innocent6/10 (60%)

Four invented-name cases produce low-confidence PEP near-matches (conf 42, AUTO_CLEAR). Intentionally tolerated — see FP cluster.

Example cases

How the pipeline handles real inputs.

Exact matchconfirmed_match

Vladimir Putin

Exact name on OFAC SDN list. High confidence, primary sanctions source.

Transliterationpossible_match

بوتين

Arabic script transliteration matched to Latin-script OFAC entry via cross-script pipeline.

Alias variantpossible_match

V. Putin

Initial + surname matched via alias expansion and token reordering.

False positive / auto-clearclear

John Smith

Common name with no distinguishing fields. Correctly cleared with low confidence. No review required.

Low-confidence PEP near-matchclear

Myrtalee Quennford

Phonetic near-match to PEP entry. AUTO_CLEAR with human_review_required: false. Intentionally tolerated to protect recall.

False-positive cluster

Borderline cases we publish, not hide.

Six borderline cases remain after Phase 2 FP reduction (May 2026). All six are low-confidence PEP near-matches at exactly confidence 42 — the minimum match threshold. They are not sanctions block recommendations: the adjudication engine marks all six as AUTO_CLEAR with human_review_required: false. Suppressing them would risk false negatives on real phonetic PEP matches.

Zandralina PoffwickInnocent42% confidence

PEP phonetic near-match (Zandra Maulen Jofre). AUTO_CLEAR.

Thorblast GrimmjawInnocent42% confidence

PEP fuzzy near-match (Thoralf Heimdal). AUTO_CLEAR.

Myrtalee QuennfordInnocent42% confidence

PEP phonetic near-match (Myrtle Cole). AUTO_CLEAR.

Jurbinka FretzelmannInnocent42% confidence

PEP phonetic near-match (Freitas Junior). AUTO_CLEAR.

Test UserAdversarial42% confidence

PEP phonetic near-match (Mike Testa). AUTO_CLEAR.

Placeholder NameAdversarial42% confidence

PEP fuzzy near-match (Héctor Pérez Plazola). AUTO_CLEAR.

Pipeline

What runs behind every screen.

01

Normalize & Generate

Unicode normalization, transliteration handling, lexical search, and semantic candidate generation across the configured active screening index.

02

Multi-algorithm Score

Exact, fuzzy, phonetic, and contextual scoring work together instead of relying on a single similarity metric.

03

FP Suppression

Penalty layers reduce business-name traps, entity/person mismatches, and noisy token collisions.

04

Structured Adjudicate

Ambiguous cases receive structured verdicts (BLOCK / REVIEW / CLEAR) with documented rationale, not opaque scores.

Methodology

How to interpret the numbers.

1

Recall is the priority metric: zero false negatives on the verified run. A missed sanctioned entity is a compliance failure. Precision is important, but never at the cost of recall.

2

Phase 1 FP reduction (May 2026): surname-mismatch penalty, common-name evidence gate, and stable sort eliminated all common-name false positives.

3

Phase 2 FP reduction (May 2026): entity-name FPs eliminated. Generic business names (Global Trading LLC, Eastern Star Consulting) now correctly clear via Strategy 3 gate and Penalty 14.

4

Remaining 6 FPs are intentionally tolerated borderline cases: all are low-confidence PEP near-matches at confidence 42 that the API already marks as AUTO_CLEAR with human_review_required: false. Suppressing them risks recall loss on real phonetic PEP variants.

5

Latency is acceptable for production: p50 45ms, p95 202ms, p99 276ms on the clean rerun.

6

The value of publishing this page is transparency, not claiming a perfect score. These numbers will change as usage data informs further tuning.

Context

How this compares to published research.

These numbers are useful for context, but the datasets are not directly comparable. The point of this page is transparency, not pretending one benchmark can replace every other one.

SystemF1RecallPrecisionNotes
Verifex (published benchmark)96.6%100%93.4%122 sanctions-specific cases. Self-administered.
GPT-4o (Published Pairs Dataset)98.95%755K labeled pairs, different dataset
DeepSeek-R1 14B (Published Pairs Dataset)98.23%755K labeled pairs, different dataset
Rule-based baseline (Published Pairs Dataset)91.33%Published fuzzy baseline

Transparency

Why benchmark transparency matters.

Auditable claims

Every accuracy claim on this page is tied to a reproducible test case. No marketing metrics without source data.

Failure disclosure

We publish the exact queries that produce false positives, not just the headline score. Compliance teams need to know edge cases.

Continuous improvement

The benchmark is rerun after every major pipeline change. Scores that go down are published alongside scores that go up.

Get started

Run your first screen.

Start with 50 free screens. No credit card. See the benchmark in action on your own data.