Achieving 100x Compression on Scraped Pricing Data in ScyllaDB
At Smartpricing we continuously scrape competitor prices from major booking platforms so our pricing models can see the market the way a revenue manager would. The raw volume of that feed is unreasonable: tens of millions of rooms, sampled many times a day, across many occupancies, lengths of stay, and rate plans. A naive storage layout would balloon into tens of terabytes within months. This post walks through how we landed on a design that keeps the full history online in roughly 300 GB on a small ScyllaDB cluster, at a compression ratio of about 100x compared to the uncompressed JSON we started with.
The problem
The scraper fans out against a large catalog of properties. For each property we receive a matrix of prices keyed by check-in date, length of stay, and rate plan, repeated every few hours. In the early days we stored each scrape as a row-per-observation in PostgreSQL and quickly ran into the usual problems: vacuum pressure, bloated indexes, and queries that needed to scan months of history for a single property taking much longer than the downstream pricing job could tolerate.
We also needed the full history to stay queryable. Analysts and model training jobs regularly ask for “all observations for property X between dates Y and Z”, and the answer should come back in seconds rather than minutes. Throwing old data into cold storage was not acceptable.
Why ScyllaDB
The workload is write-heavy, partitioned naturally by property, and accessed as a time series. That maps cleanly onto a wide-column store. We evaluated Cassandra and ScyllaDB; ScyllaDB’s lower operational footprint and the shard- per-core architecture made it an easy choice for a small team. A three-node cluster on modest hardware comfortably absorbs the write rate with headroom to spare.
Just as importantly, ScyllaDB’s table-level compression settings let us trade a small amount of CPU on reads for large wins on disk and replication bandwidth. Given how compressible the data is, this trade was a foregone conclusion.
The compression approach
The two biggest levers were the data model and the chunk compressor. The model groups all observations for a single property and scrape window into one partition, with clustering columns that sort observations in a way that makes adjacent rows look extremely similar to each other. Consecutive rows differ by a day of stay, an occupancy, or a rate plan, so long runs of identical or near-identical bytes land next to each other on disk - which is exactly what block compressors love.
With that layout, ZStd at a moderate level produces compression ratios that would be implausible on unsorted JSON. The choice of partition granularity matters: too fine and we pay per-partition metadata overhead on every row; too coarse and we start eating into read latency because each read has to decompress a bigger SSTable chunk. We landed on one partition per property per scrape day, which keeps hot partitions small enough to sit comfortably in page cache.
CREATE TABLE prices.observations (
property_id bigint,
scrape_day date,
check_in date,
length_of_stay smallint,
occupancy smallint,
rate_plan text,
scraped_at timestamp,
price_cents int,
currency text,
PRIMARY KEY ((property_id, scrape_day),
check_in, length_of_stay, occupancy, rate_plan, scraped_at)
) WITH compression = {
'sstable_compression': 'ZstdCompressor',
'chunk_length_in_kb': 64,
'compression_level': 6
}
AND compaction = { 'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': 7 };
The final piece is compaction. Time-window compaction keeps old data in large, cold SSTables that don’t get rewritten, which is ideal: we want the compressor to see as much similar data at once as possible, and we don’t want to pay to re-compress data that will never change.
Results and lessons learned
End-to-end we see a compression ratio a bit over 100x versus the raw JSON the scraper emits, and around 20x versus a more compact binary encoding of the same rows. Disk usage sits around 300 GB across the cluster for the full history we keep online, and reads for a single property-month return in tens of milliseconds.
The main lesson, and the one I keep coming back to, is that compression ratios are a property of the data model, not the compressor. ZStd is excellent, but the reason the numbers are this good is that adjacent rows in a partition are almost identical. Any schema change that breaks that locality - for example, interleaving properties in a single partition - would cut the ratio by an order of magnitude overnight.