Model card

Matching methodology.

How Verifex scores sanctions matches: the four stages, confidence calibration, threshold rationale, and known limitations.

Pipeline

Four-stage matching

Stage 1: Exact matching

Token-reordered exact match with Unicode normalization (NFKC). Catches identical names regardless of word order. Fast path for obvious hits and clear non-hits.

Stage 2: Fuzzy matching

Jaro-Winkler, Monge-Elkan, and Soft TF-IDF cosine similarity for edit-distance and token-based matching. Handles typos, spelling variants, and name reordering. IDF weighting reduces common-name false positives.

Stage 3: Phonetic matching

Double Metaphone encodes names by pronunciation rather than spelling. Catches transliteration variants (e.g., Mohammed / Muhammad / Mohamed) that edit-distance metrics miss.

Stage 4: Structured adjudication

For ambiguous matches (typically 5–15% of screenings), an optional assisted-review stage produces documented rationale based on match context, available metadata, and entity type. The output is a structured verdict with explanation — not an autonomous compliance decision.

Confidence & thresholds

How scores become decisions

The matching pipeline produces a raw similarity score (0–100). This score is then passed through a penalty chain that adjusts for known failure modes: entity-type mismatch, common-name inflation, substring traps, generic business names, and surname-only queries.

The final confidence score maps to a recommendation:

  • AUTO_CLEAR — Low confidence or penalized match. No human review required.
  • REVIEW — Moderate confidence. Match context provided for analyst review.
  • BLOCK — High confidence on a sanctions source. Immediate escalation recommended.

Thresholds are currently calibrated against the published 122-case benchmark. They are not guaranteed to be optimal for every customer population. Enterprise customers can request threshold tuning against their own labeled data.

Limitations

Known limitations

The benchmark is vendor-authored and self-administered. It should not be read as an independent audit.

Production data has long-tail edge cases not captured by any finite benchmark. Customers should evaluate on representative production data.

Cross-script transliteration (Arabic, Cyrillic, Chinese) is improved but not solved. Some valid transliterations may be missed.

Common names (Mohammed, Kim, Smith) generate near-matches at low confidence that are intentionally tolerated to protect recall.

Entity-type mismatches (person vs. organization) are penalized but not eliminated.

Thresholds are calibrated against the published benchmark and may need adjustment for specific customer populations.