What is a good false positive rate for sanctions screening?

Industry benchmarks vary, but most mature compliance programs aim to reduce false positives without lowering recall. Legacy systems often produce 95%+ false positives. Verifex's current published 500-case benchmark (Engine v3) reports 100% precision and 0 false positives, with the 3 remaining false negatives disclosed publicly.

How does IDF weighting reduce false positives for common names?

IDF (Inverse Document Frequency) weighting assigns lower significance to common name tokens like 'Mohammed' or 'Ali' and higher significance to rare tokens. This means a match on a rare surname contributes far more to the confidence score than a match on a common first name, dramatically reducing false alerts for common names.

Can AI completely eliminate false positives in sanctions screening?

No. AI verification significantly reduces false positives by cross-referencing contextual data like nationality, date of birth, and known aliases. However, some ambiguous cases will always require human review. The goal is to reduce the volume of manual reviews to a manageable level while maintaining high recall. Verifex's 500-case benchmark reports 99.36% recall with 3 documented false negatives.

What is the difference between precision and recall in sanctions screening?

Precision measures how many of your flagged alerts are actual matches (higher precision means fewer false positives). Recall measures how many actual sanctioned entities your system catches (higher recall means fewer missed matches). The challenge is balancing both — reducing false positives without letting real matches slip through.

Back to Blog

Engineering

March 26, 20269 min read

How to Reduce False Positives in Sanctions Screening — A Technical Guide

If you run a sanctions screening system in production, you already know the biggest operational problem is not missed matches. It is the overwhelming volume of false positives. Industry data consistently shows that 95% or more of sanctions alerts are false positives — flags that trigger manual review but turn out to be innocent customers with names that happen to resemble a sanctioned entity.

For a compliance team processing 10,000 screens per day, that means 9,500+ alerts that waste analyst time, slow down customer onboarding, and erode trust in the screening system. When analysts spend all day dismissing false alerts, they start rubber-stamping reviews, which is exactly how real matches get missed.

This article explains the technical approaches that reduce false positives while maintaining high recall, from algorithmic improvements like IDF weighting to structured adjudication with documented rationale pipelines. We will use concrete examples and real numbers from Verifex's internal benchmark to show what is achievable.

Why simple string matching produces so many false positives

The root cause of the false positive problem is that most sanctions screening systems rely on basic string similarity. They compute a Levenshtein distance or Jaro-Winkler score between the query name and every entry in the sanctions database, then flag anything above a threshold.

This approach has a fundamental flaw: it treats every character in the name as equally important. Consider screening the name "Mohammed Ali Hassan". The token "Mohammed" appears in thousands of sanctions entries because it is one of the most common names in the world. "Ali" is similarly frequent. A basic fuzzy matcher will flag dozens of entries because two out of three name tokens match, even though the third token and all contextual details are completely different.

This is sometimes called the "Mohammed problem" in compliance circles, but it applies to any common name. "David Kim," "Carlos Garcia," "Wang Wei" — all of these generate disproportionate false positives because the name tokens are so frequent in the global population that random matches are inevitable.

The solution is not to lower your matching threshold. That would reduce false positives but also increase false negatives — missed matches on actual sanctioned entities. Instead, you need smarter matching algorithms that understand which parts of a name are distinctive and which are not.

Multi-algorithm matching: fuzzy, phonetic, and beyond

The first step toward reducing false positives is using multiple matching algorithms in a pipeline, rather than relying on a single similarity metric. Each algorithm catches different types of matches and has different false positive characteristics.

Fuzzy matching (Levenshtein distance) computes the edit distance between two strings — the number of character insertions, deletions, or substitutions needed to transform one into the other. This catches typos, minor misspellings, and small transliteration differences. A query of "Aleksadr Petrov" matches "Aleksandr Petrov" with high confidence because only one character differs.

Phonetic matching (Soundex, Metaphone) converts names into phonetic codes based on pronunciation rather than spelling. "Schmidt" and "Smith" have very different spellings but identical Soundex codes (S530). This catches transliterations from non-Latin scripts where the same name can be spelled many different ways in English.

Token-based matching breaks names into individual tokens and compares them independently, handling different name orderings. "Petrov, Aleksandr Sergeevich" and "Aleksandr Petrov" share key tokens even though the full strings differ significantly. This is critical because sanctions lists often use different name ordering conventions than customer input forms.

Running all three in sequence with short-circuiting — exact match first, then fuzzy, then phonetic — gives you broad coverage. But coverage alone does not solve false positives. You need a way to weight the results intelligently.

How IDF weighting solves the common name problem

Inverse Document Frequency (IDF) is a concept borrowed from information retrieval. The core idea is simple: tokens that appear in many documents (or in this case, many sanctions entries) are less informative than tokens that appear rarely.

The IDF weight of a token is calculated as:

IDF(token) = log(N / df(token))

Where:
  N  = total number of entries in the sanctions database
  df = number of entries containing that token

For a sanctions database with 30,000 entries, the numbers look something like this:

"Mohammed" appears in 2,400 entries. IDF = log(30000/2400) = 2.53. Low weight — not distinctive.
"Ali" appears in 1,800 entries. IDF = log(30000/1800) = 2.81. Still low weight.
"Qadhafi" appears in 3 entries. IDF = log(30000/3) = 9.21. Very high weight — extremely distinctive.

When scoring a match, instead of giving equal weight to every token, you multiply each token's similarity score by its IDF weight. A match on "Qadhafi" contributes 3-4x more to the final confidence score than a match on "Mohammed."

This has a dramatic effect on false positives. Consider two screening scenarios:

Without IDF weighting: Screening "Mohammed Ali Hassan" matches "Mohammed Ali al-Houthi" at 72% confidence because two out of three tokens match. This triggers a manual review.

With IDF weighting: The same match scores only 38% because "Mohammed" (IDF 2.53) and "Ali" (IDF 2.81) contribute very little to the score, while the mismatched "Hassan" vs "al-Houthi" — both with moderate IDF — pull the score down significantly. The alert is not triggered.

Meanwhile, screening "Muammar Qadhafi" still matches "Muammar al-Qadhafi" at 94% confidence because "Qadhafi" (IDF 9.21) dominates the score, and "Muammar" (IDF 7.8) also has a high weight since it is rare. The true match is preserved.

In our technical deep-dive on fuzzy matching, we explain how Verifex combines Levenshtein distance with IDF weighting in the scoring pipeline.

Structured adjudication for ambiguous matches

Even with IDF weighting, some matches remain genuinely ambiguous. The name tokens match with moderate confidence, the IDF weights are in an inconclusive range, and you cannot automatically approve or reject the alert. This is where structured adjudication with documented rationale adds the most value.

The approach is a cascade architecture: fast algorithmic matching handles the clear cases (exact matches and obvious non-matches), and an optional assisted review stage is invoked only for the ambiguous middle band — typically 5-15% of all screenings. The output is a documented rationale, not an autonomous decision.

When a match falls in the ambiguous zone (for example, confidence between 55% and 80%), the system sends the match context to an LLM along with all available metadata:

The query name and the matched sanctions entry name
All known aliases of the sanctions entry
Date of birth, nationality, and address of the sanctions entry
The customer-provided metadata (DOB, nationality, passport number if available)
The entity type (person vs. organization)

The LLM evaluates whether the match is plausible given all available context. It can reason about things that pure string matching cannot: "The sanctioned entity is a 68-year-old Iranian national, but the customer is a 29-year-old Canadian citizen. Despite the name similarity, these are almost certainly different people."

This cascade design keeps costs low because the LLM is only invoked for a small fraction of screenings. The vast majority are resolved by the fast algorithmic layer with low latency. The LLM adds 1-3 seconds for ambiguous cases, which is acceptable since these would have gone to manual review anyway.

Verifex's benchmark results

We publish our matching accuracy numbers openly because we believe transparency builds trust in compliance tooling. On our internal benchmark, Verifex achieves:

F1 Score: 99.68% — the harmonic mean of precision and recall, measuring overall matching quality
Precision: 100% — the latest verified run shows 0 false positives on the 500-case benchmark
Recall: 99.36% — of all actual sanctioned entities in the published test set, 99.36% were correctly identified

Compare this to the industry average where 95%+ of alerts are false positives — that is precision of roughly 5%. Verifex's latest published precision is 100%, which is materially better than legacy tooling. The 3 remaining false negatives are documented and inspectable.

The critical number in this benchmark is recall at 99.36%. Reducing false positives is only valuable if you are not simultaneously letting real matches slip through. A system that flags nobody has zero false positives and zero value. High recall ensures that when a sanctioned entity is screened, it gets caught.

Practical tips for reducing false positives

Beyond the algorithmic improvements described above, there are practical steps any compliance team can take to reduce false positive volume:

1. Use entity type filtering. If you know the input is a person, do not match against vessel or aircraft entries. If it is a company, do not match against individual entries. This alone can cut false positives by 15-20%.

2. Collect and use date of birth. Many sanctions entries include DOB information. If your customer's DOB is 1995 and the matched entry's DOB is 1952, you can safely downgrade the alert. Verifex allows you to pass DOB as an optional parameter to enable this filtering automatically.

3. Use nationality and country of residence. A customer based in Norway with Norwegian citizenship matching against a sanctioned individual from Syria should be scored lower than a customer from Syria matching the same entry.

4. Set confidence thresholds by risk tier. Not all customers carry the same risk. For a standard retail customer, you might auto-approve below 70% confidence. For a customer from a high-risk jurisdiction or dealing in large transaction volumes, you might set the threshold at 50%.

5. Maintain a confirmed false positive whitelist. Once an analyst has confirmed a match is a false positive, record that decision so the same customer does not trigger the same alert during re-screening. Include the specific sanctions entry ID in the whitelist, not just the customer name, so new sanctions entries still get flagged.

6. Screen with full legal names. Screening "M. Hassan" will produce far more false positives than "Mohammed Ibrahim Hassan." Collect full names at onboarding and use them for screening. Nicknames and abbreviations should be screened as secondary queries, not primary.

How to measure your false positive rate

You cannot improve what you do not measure. Here is how to calculate your false positive rate and track it over time:

False Positive Rate (FPR) = False Positives / (False Positives + True Negatives). This tells you what percentage of innocent customers get incorrectly flagged.

Precision = True Positives / (True Positives + False Positives). This tells you what percentage of your alerts are actually worth investigating.

To calculate these, you need labeled data. Take a random sample of 200-500 alerts from the past month, have an analyst classify each as true positive or false positive, and compute the ratios. Repeat this monthly to track trends.

If your precision is below 10% (meaning more than 90% of alerts are false), you have a serious problem that is costing your team time and risking compliance fatigue. If your precision is above 50%, you are in strong shape. Above 80% — where Verifex operates — your analysts spend most of their time on genuinely suspicious matches.

You should also track alert volume per 1,000 screens. If you are generating 50 alerts per 1,000 screens and your customer base is mostly low-risk individuals from low-risk jurisdictions, your matching is probably too aggressive. A well-tuned system typically generates 5-15 alerts per 1,000 screens for a standard fintech customer base.

Choosing the right screening provider

If you are evaluating sanctions screening APIs, false positive rate should be a primary selection criterion — not just list coverage or price per screen. A provider that costs $0.01 per screen but generates 10x more false positives than a flat-rate provider will cost you far more in analyst time and delayed onboarding.

Ask prospective providers for their published benchmark numbers. If they do not have any, that is a red flag. Check our sanctions screening API comparison for a detailed breakdown of how different providers handle matching quality and false positive reduction.

Summary

Reducing false positives in sanctions screening requires a multi-layered approach. Simple string matching will never achieve acceptable precision because it treats all name tokens as equally important. IDF weighting solves the common name problem by down-weighting frequent tokens. Structured adjudication with documented rationale handles the ambiguous middle band that algorithmic approaches cannot resolve confidently.

The practical steps matter just as much as the algorithms: filter by entity type, use DOB and nationality, set risk-appropriate thresholds, maintain a whitelist, and measure your false positive rate regularly. With the right combination of smart matching and operational discipline, you can bring your false positive rate from 95% to under 20% without sacrificing recall on genuine matches.

This guide is for technical and operational education. Verifex provides screening infrastructure and evidence records, not legal advice, transaction approval, or a replacement for your risk-based compliance program.

Get started with Verifex

Screen against OFAC, UN & UK sanctions lists in one API call. Free tier available.

Start screening free