Building a Convergent Event Deduplication Pipeline with Graph Algorithms

Hotel demand is driven by what’s happening around a property, and one of the strongest signals is large public events: concerts, sports fixtures, trade shows, festivals. At Smartpricing we pull that signal from a number of third-party feeds and open data sources, which is where the interesting problem starts. The same event - say, a single concert on a single night - shows up in three or four of those feeds, with different names, different venue spellings, slightly different start times, and occasionally a date that’s off by a day because of timezone confusion. We needed one stable identifier per real-world event, surviving across runs, so downstream pricing models could attach features to it without the identifier shifting underneath them.

The problem

Naive deduplication - group by exact name and date - fails immediately. “U2: Achtung Baby Tour” and “U2 - Achtung Baby” are the same concert. A match on venue is tempting, but venue names drift too: “Arena Verona”, “Verona Arena”, and “Arena di Verona” are all one place. Geographic coordinates help, but only up to the precision each source provides, and some sources geocode to the city rather than the venue.

The harder constraint is stability. If yesterday we assigned an event the identifier evt_a7f3, we really want today’s run to arrive at the same identifier even if a new source appears, an old source drops out, or a field changes slightly. Pricing models that were trained with evt_a7f3 as a feature should continue to work. That ruled out anything that assigned IDs by, say, hashing the full input set.

Why naive deduplication fails

The first instinct is to write a cascade of rules: try exact match on normalized name and date; fall back to fuzzy match with a threshold; then venue match, then geographic match. In practice this produces a system that is impossible to reason about. Two sources can agree strongly on three attributes and disagree on a fourth, and the order in which rules fire determines the outcome. The edge cases compound, and every new source adds a dimension of breakage.

The second instinct is a big embedding model with a cosine similarity threshold. It works on benchmarks and falls apart in production: a threshold that merges “Rolling Stones Verona” with “The Rolling Stones at Arena di Verona” will also happily merge two different tribute bands playing on consecutive nights. The threshold has to be set per-source- pair, and then you are back to rule cascades.

The approach

I framed deduplication as graph clustering. Nodes are raw event records from each source. Edges connect records that look similar by a pairwise scoring function. The clusters are the connected components of that graph, and each cluster becomes a canonical event.

To keep the graph small I first do a coarse geospatial blocking step. Events get bucketed by an H3 cell at a resolution that comfortably contains a single venue plus some slack. We only consider edges between records in the same cell or in a neighboring cell, which drops the candidate pairs from quadratic in the number of events to something much more manageable. Within each block we run a comparison that mixes normalized-string similarity on name and venue, date-and-time distance, and a cheap category check.

Connected components on the resulting graph gives us the deduplicated event set. There is no threshold tuning per source pair, because the graph doesn’t care which source an edge came from; it only cares whether the edge exists. Adding a new source means adding new nodes and edges, and the components rearrange themselves.

Convergent deduplication

The stability requirement was the most interesting part. The trick is to treat the identifier assignment as a fixed point over runs rather than as a fresh computation. On each run I take the current set of raw records, compute the new components, and then match those components against the previous run’s components by maximum overlap of contributing records. A component that overlaps an old one by more than a threshold inherits the old identifier; a genuinely new component gets a fresh one.

This gives the pipeline its convergent property: if the input is stable, the identifiers are stable. If a source adds a field that tightens an edge, two previously separate components can merge, and the larger of the two keeps its identifier while the other is redirected to it. The redirection is logged so downstream consumers can reconcile any features they attached to the retired ID.

A diagram of this - raw records flowing into H3 cells, pairs being scored, components being computed, and prior-run IDs being assigned - is something I plan to add to this post as an SVG once I’ve drawn it.

What I’d change

The biggest regret is not writing the pairwise scorer with evaluation in mind from day one. The system works, but the only way to tell whether a scoring change is an improvement is to run it end-to-end and look at the diff. Next time I’d maintain a labelled evaluation set of true-positive and true-negative pairs and gate changes on it.

The second thing is that H3 is convenient but not magic. A handful of city-center venues sit near cell boundaries, and events there occasionally miss each other across runs. Adding a second blocking pass on a shifted cell grid would close that gap cheaply.

← Back to blog