Entity Resolution in Practice: Deduplicating Real-World Procurement Data
While working on a Community Detection proof of concept with one of our clients, we kept running into the same companies showing up under different IDs. Same name, same address, same phone number, but the system treated them as completely separate entities. If you've ever worked with messy real-world data, you know the feeling. The duplicates were polluting our graph database and undermining the analysis.
We needed a way to catch these duplicates before they entered the pipeline. That's what led us to entity resolution, and specifically to Splink.
The data: EU public procurement records
We wanted to test our approach on real, messy data. The EU's Tenders Electronic Daily (TED) platform was a perfect candidate. TED publishes public procurement notices from across Europe as XML bulk downloads, including thousands of contract notices containing structured information about organizations: names, addresses, phone numbers, emails, websites, and country codes.
We parsed the XML files and extracted organization records into a flat table. The result: 106,971 rows across 10 columns: company ID, name, telephone, email, street, city, postal code, country subentity code, website, and identification code.
It's the kind of dataset where the same organization can appear dozens of times with slight variations. A company might be listed as "Vergabestelle" in one notice and "Zentrale Vergabestelle" in another, with a slightly different email or a missing phone number. These aren't errors, exactly. They're just the natural messiness of data entered by different people at different times.
Why not just do exact matching?
The obvious first instinct is to write some SQL: match on name, or email, or phone number. If two records share the same value, they're the same company. Done.
Except it doesn't really work. Names have typos. Emails change. Phone numbers get reformatted. A rule-based approach is brittle: it either misses duplicates (too strict) or creates false matches (too loose). And it doesn't scale well. You end up writing an ever-growing pile of special cases and still never quite getting it right.
What we needed was something that could look at multiple fields at once, tolerate fuzzy differences, and give us a probability rather than a binary yes/no.
Enter Splink
Splink is an open-source Python library built by the UK's Ministry of Justice for exactly this kind of problem. It does probabilistic record linkage, which is a fancy way of saying it uses statistics to figure out whether two records are likely referring to the same real-world entity, even when there's no shared unique identifier.
Under the hood, it implements the Fellegi-Sunter model. The core idea is intuitive: for each pair of records, you look at each field (name, email, city, etc.) and ask: given that these two records match on this field, how much more likely is it that they're truly the same entity?
The math boils down to:
posterior odds of match = prior odds × (m₁ × m₂ × ... × mₙ) / (u₁ × u₂ × ... × uₙ)
Where:
- Prior odds are the baseline probability that any two random records match
- m is how often a given comparison outcome happens when records do match
- u is how often it happens when they don't
If two records share the same rare email address, that's very strong evidence of a match (high m, low u). If they're both in "London", lots of non-matching records share that city, so it's weaker evidence. Splink learns these weights automatically from the data, no labeled training set required.
The scalability problem (and how blocking solves it)
Here's a practical issue: comparing every record against every other record means N×(N-1)/2 pairwise comparisons. For our dataset of ~107K records, that's already enormous. For a million records? You're looking at 500 billion comparisons. Your laptop would catch fire.
Splink handles this with blocking, a technique that pre-filters which pairs are even worth comparing. Instead of comparing every record to every other, you only compare records that share at least one attribute value. In our case, we set up blocking rules on telephone, email, and city:
blocking_rules_to_generate_predictions=[
block_on('telephone'),
block_on('email'),
block_on('city')
],
Under the hood, this translates to something like:
SELECT *
FROM df AS l
JOIN df AS r
ON l.telephone = r.telephone
OR l.email = r.email
OR l.city = r.city
This dramatically cuts down the number of comparisons while still catching the vast majority of true matches.
Configuring comparisons
Once Splink has a set of candidate pairs, it needs to know how to compare them. This is where you define comparison rules for each column, and this is one of the parts we spent the most time tuning.
Here's what we configured:
| Column | Comparison Type | What it does |
|---|---|---|
| name | DistanceFunctionAtThresholds | Levenshtein distance at 0.7, 0.8, 0.9. Captures typos and abbreviations |
| EmailComparison | Purpose-built for email matching, handles format variations | |
| website | DistanceFunctionAtThresholds | Levenshtein at 0.8 and 0.9. Catches example.com vs www.example.com |
| city | ExactMatch (with term frequency) | Exact match, but weighs rare cities higher than common ones |
| post_number | PostcodeComparison | Specialized postal code comparison, handles formatting differences |
| identification_code | ExactMatch | Country code, must match exactly |
The term frequency adjustment on city is worth highlighting. Without it, two records both listing "Berlin" would get the same match score as two records both listing some tiny village. With term frequency adjustment, the rare match gets more weight because sharing an uncommon city is much stronger evidence of a true match.
After comparisons, the Fellegi-Sunter model assigns weights and combines them into an overall match probability for each pair.
Results: clusters and purity
Splink doesn't just give you pairs. It groups matched records into clusters, where each cluster represents what the model believes is a single real-world entity. Think of it as connected components in a graph: if record A matches B, and B matches C, then A, B, and C all end up in the same cluster even if A and C weren't directly compared.
Our run produced 30,462 clusters from the ~107K records (the original data contained 29,964 unique company IDs). To evaluate how well the model performed, we used cluster purity, a measure of how many records in each cluster belong to the same true entity.
Purity of a cluster = (records belonging to the majority class) / (total records in the cluster)
The results:
- Total cluster purity: 0.88
- 1,693 clusters had purity below 100%, meaning they contained records from more than one company
Not bad for an unsupervised model with no training data. But we wanted to dig deeper.
Digging into the results
We looked at clusters on both ends of the purity spectrum.
The good: Cluster 4716 (100% purity). This cluster correctly grouped dozens of records for the same German federal organization (company ID 994-DOEVD-83), all appearing under slightly different department names like "Vergabestelle," "Zentrale Vergabestelle," "Einkauf Wissenschaftliche Geräte," etc. Different names, same phone number, same city. Splink nailed it.
The imperfect: Cluster 1 (67% purity). This cluster grouped six records, five with company ID 1036.E00341.0001 and one with 090101655. The records were nearly identical: same name, same phone, same email, same address, same city. The only difference was the company ID itself. This looks like a data quality issue in the source, not a Splink mistake. The model did exactly what you'd want: it said "these records clearly refer to the same entity."
A subtlety: companies spanning multiple clusters
We also checked for the reverse problem: a single company ID appearing across multiple clusters. We found 1,568 company IDs split across more than one cluster. The total purity measured by company ID (rather than cluster ID) came out to 0.90.
Why does this happen? Consider company 37836302645. It appeared in three records across two clusters. Two records had the email [email protected] and the website https://min-kulture.gov.hr/. The third had a different email ([email protected]) and a different website format (http://www.min-kulture.hr). The phone number also differed slightly. Enough variation across enough fields that Splink, reasonably, wasn't confident enough to merge them.
This is the kind of thing you'd need additional business logic or manual review to resolve.
How does Splink compare to simpler approaches?
We benchmarked Splink against three naive baselines: clustering purely by website, by email, or by name. Here's how they stacked up:
| Metric | Website | Name | Splink | |
|---|---|---|---|---|
| Number of clusters | 14,101 | 30,953 | 32,067 | 30,462 |
| Purity (cluster ID) | 0.87 | 0.85 | 0.90 | 0.88 |
| Purity (company ID) | 0.87 | 0.86 | 0.85 | 0.90 |
| Avg purity | 0.87 | 0.85 | 0.87 | 0.89 |
A few takeaways:
- Email is a strong signal for grouping. Records sharing an email almost certainly belong to the same company. But companies don't always use the same email across notices, so email-only clustering misses matches.
- Website matching yields fewer clusters but introduces more errors, since URL formatting varies widely.
- Splink outperforms on average by combining all signals together. Its 0.89 average purity beats every single-field approach. More importantly, its company-ID purity (0.90) is the highest, meaning it's the best at keeping each real company in a single cluster.
What we learned
A few honest reflections from this experiment:
Splink works remarkably well out of the box. For an unsupervised approach with no labeled data, getting 0.89 average purity on messy, multilingual EU procurement data is solid. The Fellegi-Sunter model, combined with smart blocking and flexible comparison functions, handles real-world messiness far better than any rule-based approach we could have written by hand.
Configuration matters. The choice of blocking rules, comparison functions, and thresholds had a real impact on results. We spent time tuning Levenshtein thresholds for names and websites, choosing the right comparison type for emails and postcodes, and deciding which fields to use for blocking. This isn't a "plug and play" tool. You need to understand your data.
No model is perfect. The 1,568 companies split across multiple clusters remind us that probabilistic matching has limits. When the same entity appears with genuinely different contact details across records, even a good model will struggle. Post-processing rules or human review may be needed for high-stakes applications.
The data itself is the hardest part. Parsing XML, cleaning fields, handling multilingual text, normalizing phone numbers. This unglamorous work took more time than configuring Splink itself. As always, the quality of your input determines the quality of your output.