Designing a Convergent Event Deduplication Pipeline

Hotel demand often spikes because of what happens nearby. Concerts, sports fixtures, trade fairs, and festivals all move occupancy, so event data is a high-value forecasting signal.

The problem: each source describes the same real-world event a little differently. One concert can show up with three names, two venue spellings, and slightly shifted start times. If each variation is treated as a separate event, forecasting noise compounds quickly.

This post walks through a production-oriented deduplication design that solves three constraints at the same time:

  1. Merge accuracy: records that describe the same event should converge.
  2. ID continuity: canonical IDs should remain stable across runs.
  3. Computational tractability: candidate generation must aggressively shrink the search space so matching remains fast at scale.

If you only remember one thing, make it this: deduplication is not just about matching records in one run. It is about maintaining a durable identity layer that downstream systems can trust over time.

Why this is harder than it looks

Naive rules break almost immediately:

  • Exact name + date fails on harmless naming drift.
  • Venue-only matching fails when labels are inconsistent.
  • A single similarity threshold creates brittle, hard-to-debug decisions.

The more demanding requirement is continuity. Even when data changes between runs, consumers still need a stable canonical ID so historical features stay attached to the same entity.

Stable IDs are also a product requirement, not just a data-engineering one. Users can tune event weights and enable or disable specific events. If an event gets a new ID on the next run, those user choices no longer apply to the same real-world entity. The UX degrades immediately.

System model

The most useful mental model is a graph:

  • Nodes: raw event records from all sources.
  • Edges: likely matches between records.
  • Components: connected groups of matching records.
  • Canonical events: one entity per component.

With this framing, the pipeline stops being a pile of source-specific exceptions. It becomes one graph problem with explicit merge behavior.

Candidate generation without the combinatorial explosion

The first hard limit is computational: we cannot compare every event with every other event. At million-scale, the naive space is effectively quadratic and quickly becomes impractical, especially for pipelines that run multiple times per day.

We shrink the search space in two stages.

Stage 1: venue pre-pass (V^2)

Before event-level matching, we run a venue-distance pre-pass:

  • Compute pairwise distances for all venue pairs (V^2).
  • Keep only venue pairs within a max radius.
  • Build a venue-neighborhood index: for each venue, keep only nearby venues worth searching.

Importantly, this step is geospatial only: it relies on venue coordinates and distance thresholds, not venue-name similarity, tags, or other venue metadata.

At current scale, this pass performs well in production because it is vectorized (NumPy) and the venue universe grows much more slowly than event volume. If venue cardinality grows materially, a natural next optimization is geospatial bucketing (for example with H3) to avoid full pairwise distance evaluation.

Stage 2: date buckets (event-level filtering, per run)

After venue filtering, candidate volume is still too high, so we add a temporal constraint.

Define maxMergeStartGapDays as the maximum start-date difference (in days) that can still represent the same real-world event.

  • Use each event’s start date as the reference date.
  • Assign bucket IDs with fixed-width maxMergeStartGapDays buckets: bucket_id = floor((start_date - 2000-01-01) / maxMergeStartGapDays).
  • Compare each event only with events in:
    • the same bucket,
    • the previous bucket,
    • the next bucket.

The effective candidate set for one event is then:

  • events at nearby venues from the venue-neighborhood index,
  • intersected with events in {bucket_id - 1, bucket_id, bucket_id + 1}.

This two-stage blocking strategy is what makes the downstream graph merge step operationally feasible. It also gives the pipeline a near-linear growth profile over time: as new events arrive, they are compared only against nearby time buckets, not the full historical corpus.

Across the full candidate-generation pipeline (venue neighborhoods + date buckets), we end up with roughly ~2M candidate comparisons in practice. A naive all-vs-all pass at this scale would be roughly 2M^2 checks (about 4 * 10^12). That is about ~1 candidate pair per event on average, or roughly ~1,000,000x fewer checks than the naive baseline.

This write-up reports the combined effect; it does not attempt to separately attribute how much reduction comes from venue filtering vs. date filtering. The reduction is not uniform across geographies. In dense cities such as Paris, London, and Milan, many venues sit within short distance thresholds, so spatial filtering removes less and date bucketing does most of the work.

Merge process walkthrough

After candidate generation, the merge phase converts near-duplicate raw records into canonical clusters.

Pairwise comparison signals (briefly)

For each candidate pair, we compute a lightweight composite similarity score from a small set of signals:

  • start-date distance,
  • fuzzy name similarity,
  • tag/category overlap,
  • venue distance.

This score decides whether to add an edge between two records in the merge graph.

A simplified example:

  • Source records A, B, and C all describe the same Coldplay show with slightly different names, venue labels, and times.
  • These records become a connected component through pairwise links.
  • The component is promoted to one canonical event ID.
  • An unrelated record D remains in a separate component and keeps its own ID.

What matters most is not any single pair score, but the component shape that emerges once linked records connect. That structure turns local pairwise evidence into one canonical event.

ID preservation across runs

Merging records inside one run is only half the job. The other half is preserving IDs when sources change, new records arrive, or fields are updated.

A simplified run-to-run example:

  • Run N has canonical ID ID1 built from records A1, B1.
  • Run N+1 ingests updated records A2, B2, plus D2 from a new source.
  • The new merged component overlaps mostly with the prior Coldplay component.
  • It inherits ID1 instead of getting a new ID.

The key idea is a second pass that treats prior-run merged events and current-run merged events as two distinct sources. We apply the same comparison logic again, but only across sources (previous vs current), not within either source. These matches define continuity: current events inherit prior IDs, and redirect mappings preserve history when multiple prior IDs converge.

Design choices that improve convergence

A few additional choices consistently improve behavior:

  • Multi-signal edge construction: rely on combined evidence rather than a single field.
  • Component-first canonicalization: decide canonical entities from connected groups, not from local pair decisions only.
  • Explicit redirect history: persist old-to-new ID mappings so downstream systems can reconcile historical references.
  • Raw-record provenance retention: keep references to underlying source records so each merged event remains explainable.

Why this scales in production

This architecture handles growth better than ad hoc rule trees:

  • New sources contribute additional nodes and edges without requiring a complete rewrite of merge logic.
  • Run-to-run behavior stays explainable through component evolution.
  • Downstream consumers can treat canonical IDs as durable references instead of volatile artifacts.

The result is a deduplication pipeline that is both practical and stable: it absorbs imperfect multi-source data while preserving an identity layer that other systems can reliably depend on.

Future work

The core design is stable, but three extensions are high leverage:

  • Evaluation loop: maintain a labeled benchmark set and track false merges vs. missed merges, not only a single aggregate metric.
  • Scalability roadmap: move from full V^2 venue distance evaluation to geospatial indexing (for example H3) as venue cardinality grows.
  • Adaptive thresholding: tune merge thresholds by geography density to reduce over-merging risk in dense urban areas.
← Back