Achieving ~100x Compression on Scraped Pricing Data in ScyllaDB

At Smartness we continuously scrape competitor prices from major booking platforms so our pricing models can see the market the way a revenue manager would. The raw volume of that feed is unreasonable: tens of millions of rooms, sampled many times a day, across many occupancies, lengths of stay, and rate plans. A naive storage layout would balloon into tens of terabytes within months. This post walks through how we landed on a design that keeps the full history online in roughly 300 GB on a small ScyllaDB cluster—on the order of ~100× less than the corresponding raw JSON volume in our estimate, and with about 20× between uncompressed logical size and compressed on-disk data inside Scylla.

Disk footprint was not the whole story: we also needed downstream jobs to turn stored curves into usable numeric form without drowning in parsing and reshaping. The write-up foregrounds compressed storage for clarity, but the representation choices were threaded with that CPU path in mind.

The problem

The scraper fans out against a large catalog of properties. For each property we receive a matrix of prices keyed by check-in date, length of stay, and rate plan, repeated every few hours. In the early days we stored each scrape as a row-per-observation in Cassandra and quickly ran into a few problems: batch writes were too big and read queries were too slow.

We also needed the full history to stay queryable. Analysts and model training jobs regularly ask for “all observations for property X between dates Y and Z”, and the answer should come back in seconds rather than minutes. Throwing old data into cold storage was not acceptable.

The storage size was increasing at an alarming rate, about a GB per day, and we were running at a limited scraping scale. Cassandra compresses the data, but the row format was simply suboptimal for the standard compression chunk sizes.

It wasn’t only disk: processing throughput counted too. Serving training jobs and dashboards from chunky JSON repeatedly paid for parse time, allocations, and shuffling rows into analytic structures—we wanted ingestion of a forward curve to stay cheap enough that I/O stayed the bottleneck, not the CPU path leading up to it.

One representative task: resolving a competitor reference quote for a fixed check-in is not a naive “latest observation wins.” Inventories hit sell-outs, quotes disappear when the accommodation is effectively full, and when the observation timestamp sits too close to the stay date, fares often bear near-date distortions we prefer not to learn from. Practical code still ends up sweeping time: find the newest scrape where a rate exists yet still falls safely ahead of arrival relative to modeling policy.

Doing that at scale over verbose JSON payloads would eat the budget before the actual pricing logic ran.

A managed Cassandra service also limited our tuning options: we could not access nodetool to check compression ratios, try different compaction strategies (without waiting weeks for compactions), or change compression algorithms.

Why ScyllaDB

The workload is write-heavy, partitioned naturally by property, and accessed as a time series. Compression was a strict requirement. That maps cleanly onto a wide-column store. We evaluated Cassandra and ScyllaDB; ScyllaDB’s lower operational footprint and the shard- per-core architecture made it an easy choice for a small team. A three-node cluster on modest hardware comfortably absorbs the write rate with headroom to spare. We compared the performance of the previous managed service (Astra) with ScyllaDB Cloud, and observed 3-10x higher throughput on ScyllaDB.

Side note: Astra stopped offering Cassandra shortly after, so we likely would have migrated to ScyllaDB anyway.

The compression approach

Before tuning ZStd and compaction, we tried three ways to represent the same scraped forward curve for a fixed property–room–LOS–occupancy slice.

Fourier. We took the next several hundred days of prices as a sampled signal and stored a frequency-domain form instead of day-by-day samples. That helps when a handful of modes explain most of the energy. For our curves, hitting a comparable reconstruction quality required too many non-negligible coefficients, so the Fourier form was never clearly smaller than simpler encodings once you account for everything you still have to store.

Periods (run-length). We also tried encoding constant-price intervals: a start date, a span in days, and a single price, only opening a new tuple when the amount changes. That shines when prices sit flat for long stretches. In practice weekend vs weekday pricing breaks the calendar into short runs even where “dynamic pricing” is tame, so intervals stayed small and per-segment overhead dominated.

Dense vectors (what we shipped). The format that won stores a compact key for the slice and a blob holding a contiguous array of float32 prices along the check-in axis, with a start_day (day-of-year) anchor so we know which calendar days the array covers. Related slices sort near each other at the Scylla/Cassandra level, so an SSTable sees long runs of nearly identical key material and correlated bytes inside the payloads—exactly what a block compressor exploits.

The trade-off is intentional: we give up some ad-hoc queryability at the single-price-cell level in exchange for much better write throughput, compression locality, and low-overhead reads for model-serving paths that already consume forward curves as vectors.

The two biggest levers are still the logical layout and the chunk compressor. Partition boundaries matter: too fine and you pay repeated metadata tax; too coarse and each read touches more compressed data at once. Grouping by property and calendar year struck a workable balance for our access patterns and working set.

We put year in the partition key alongside property_id so one partition never spans multiple calendar years. Scrapes are naturally tied to “which year’s forward calendar we are filling in,” and most reads are scoped to a property and a recent year band, so hot data stays in a bounded set of partitions while completed years age into cold, mostly static slices that compaction can park in large SSTables.

The snippet below is illustrative: the point is the prices blob and how partitioning and primary-key clustering group related slices—not a complete production DDL.

CREATE TABLE prices.observations (
    property_id   bigint,
    year          smallint,
    room_id       bigint,
    los           smallint,
    nr_adults     smallint,
    updated_at    timestamp,
    start_day     smallint,
    prices        blob,
    PRIMARY KEY ((property_id, year),
                 room_id, los, nr_adults, updated_at)
) WITH CLUSTERING ORDER BY (room_id ASC, los ASC, nr_adults ASC, updated_at DESC)
  AND compression = {
    'sstable_compression': 'ZstdCompressor',
    'chunk_length_in_kb': 64,
    'compression_level': 6
}
  AND compaction = { 'class': 'TimeWindowCompactionStrategy',
                   'compaction_window_unit': 'DAYS',
                   'compaction_window_size': 7 };

The final piece is compaction. Time-window compaction keeps old data in large, cold SSTables that don’t get rewritten, which is ideal: we want the compressor to see as much similar data at once as possible, and we don’t want to pay to re-compress data that will never change.

That said, this assumes the write pattern is mostly time-local. If many late backfills or out-of-window updates are common, TWCS can fragment behavior and you should re-check compaction strategy and window size against that reality.

Results and lessons learned

End-to-end shrink versus the scraper’s raw JSON is on the order of 100x in our estimate. We compute that estimate by sampling representative raw scrape payloads, counting their uncompressed JSON bytes (including repeated field names and envelope structure), and comparing with the retained dataset for the same property/date horizon in ScyllaDB.

For ScyllaDB-internal compression, we estimate uncompressed logical table size to compressed on-disk size at about 20x with this schema and ZStd. While this is not a direct byte-for-byte replay measurement, it is a high-confidence operational estimate derived from ScyllaDB’s scylla_column_family_total_disk_space_before_compression metric versus the table’s on-disk SSTable footprint over the same window. Disk usage sits around 300 GB across the cluster for the full history we keep online, and reads for a single property-month return in tens of milliseconds.

Metric reference:

Current operating context for these numbers:

  • three ScyllaDB nodes on modest cloud instances
  • sustained write-heavy ingestion with periodic scrape bursts
  • single property-month reads in the tens of milliseconds

The main lesson is not to treat compression as a knob you turn after the fact. Ratios track how dense the representation is before ZStd ever runs: locality on disk still matters (similar keys, correlated bytes in the prices blob), but it is only half the story. The other half is how fast you can turn stored bytes into something models and analysts can use. Even a perfect compressor does not fix a pipeline that moves fat JSON to clients, parses text, and rebuilds tabular structures in a dataframe before analysis—each step is CPU, memory, and bytes on the wire. Keeping the forward curve as vector-shaped binary you can unpack almost directly preserves both disk economy and throughput for anything downstream that reads this pricing data.

Where this design is a weaker fit:

  • if the workload needs ad-hoc predicates on individual daily prices, blobs are awkward
  • if updates are extremely sparse and mostly incremental, full-snapshot blobs over-write too much
  • if late data is frequent, the chosen time-window compaction assumptions need revisiting

What I would do differently now

  • Suppress redundant snapshots: Keep the same packed blob representation, but treat “unchanged” as hash(blob) matching the latest row—and store that hash alongside the blob so reads for the RBW check stay small. Expect read-before-write and accept rare duplicates under concurrency rather than reaching for heavyweight consistency. Separate concern: append a tiny coverage record on every scrape so “we looked” is still queryable when the heavy table stays quiet because nothing moved.
  • Re-open layout and database together: The blob-centric encoding was as much a product of fixed, always-save cadence as of the domain—when every slice gets a full snapshot on a schedule, you fight bytes and cell count first, and block compression plus partition locality earn their keep. If per-property frequencies and hash-gated writes make the observation stream genuinely sparse, a narrower, more SQL-queryable shape starts to look attractive again, but only if the backing database still sustains ingestion spikes and indexing for the worst properties, not just the median quiet case. I’d revisit Postgres/Timescale (or an OLAP path for heavy scans) on those terms, not only because raw storage got cheaper.
← Back