Mastering Real-Time Unique Counts at Massive Scale

Counting unique items, whether they’re viewers, users, or specific events, is a fundamental task in data analytics. Simple approaches work well initially, perhaps inspired by efficient methods used by platforms like Reddit. However, when systems scale to handle trillions of interactions, these basic solutions encounter significant performance hurdles.

Let’s explore the challenges faced when counting unique items at extreme scale and the optimization strategies required to maintain performance.

The Bottleneck of Sheer Volume: When Views Overwhelm

A common implementation for tracking events like post views involves storing them sorted by an identifier, such as post_id. Counters then filter by this ID. The primary scaling challenge arises directly from the number of view events associated with a single popular item (e.g., a post). Performance degradation occurs due to:

  • Scanning an excessive number of events per item (potentially billions of rows).
  • Handling concurrent queries targeting the same highly popular items.

Consider the data volume involved for posts with varying view counts:

  • 10 Million views: ~57MB (compressed)
  • 100 Million views: ~565MB (compressed)
  • 1 Billion views: ~5.5GB (compressed)

Even with efficient compression, indexing, and filtering by post_id, the time required to scan these views increases significantly:

  • 10 Million views: ~20 ms
  • 100 Million views: ~200-400 ms
  • 1 Billion views: ~2-4 seconds

Roughly, every tenfold increase in views per item can lead to a tenfold increase in query time. This is for a single query; concurrent requests for popular items will further extend these response times.

The Cardinality Crunch: When Exact Uniqueness Becomes Too Costly

While scanning billions of rows is problematic, the more significant bottleneck often arises when dealing with a high cardinality – a large number of distinct unique viewers. Functions designed for perfect accuracy, like uniqExact (common in analytics databases), demand a steep price at scale.

Query time is impacted by more than just row scanning:

  1. Hash Set Insertions: Processing each unique value requires insertion into a hash set, scaling linearly with the number of unique items.
  2. Memory Allocation: The hash set grows, consuming memory. Performance degrades sharply when the set exceeds available CPU caches or spills into main memory (RAM), or worse, requires disk swapping.

As the count of unique viewers climbs, the hash set needs frequent resizing, causing CPU stalls. When the set outgrows the fast L3 cache and relies on slower RAM, performance noticeably drops. Memory pressure leading to disk swapping is catastrophic for speed. Furthermore, larger datasets increase the probability of hash collisions, adding another performance penalty.

Performance Zones in Practice

Observing real-world systems reveals distinct performance tiers based on unique viewer counts (assuming 64-bit viewer_ids):

Unique Viewers Memory Usage Storage Location Query Time (Approx. 10% Uniqueness Rate)
1 Million ~16MB CPU L3 Cache 10-20ms (Very Fast)
10 Million ~160MB RAM ~20-60ms (Noticeable Cache Misses)
100 Million ~1.6GB RAM ~2s-5s (Heavy Memory Access)
1 Billion ~16GB RAM + Potential Swap ~15-20s (System Under Strain)

This leads to three practical zones:

  1. The L3 Cache Zone (<1M uniques): Lightning-fast performance.
  2. The Memory Zone (1M-100M uniques): Acceptable, but gradually declining performance.
  3. The Danger Zone (>100M uniques): Performance plummets dramatically.

Even on servers with substantial RAM (e.g., 32GB), systems often hit a pain threshold around 500 million unique viewers per query. Queries become unacceptably slow (multiple seconds), Out-Of-Memory (OOM) errors become a risk, and infrastructure costs escalate. Concurrent queries exacerbate all these issues.

Two Paths to Optimization

So, how can one efficiently count potentially billions of unique viewers per item within a dataset containing trillions of total views, without excessive cost or slow performance?

1. Approximate Counting with uniqCombined64

The most straightforward optimization is switching from exact counting functions to probabilistic ones like uniqCombined64.

SELECT
    post_id,
    uniqCombined64(viewer_id) as unique_viewers
FROM post_views
WHERE post_id = 'some_specific_post_id' -- Parameterized in practice
GROUP BY post_id

Functions like uniqCombined64 are often preferred over older methods like uniqHLL12. uniqCombined64 employs an adaptive strategy, using different counting mechanisms depending on the data’s cardinality:

  • Array Mode: For very small numbers of unique values, uses a simple array.
  • Hash Mode: As unique values increase, transitions to a memory-efficient sparse hash set.
  • HyperLogLog Mode: For very large cardinalities, switches to a HyperLogLog (HLL) algorithm, similar to what Reddit implemented in Redis for efficient view counting.

This adaptive approach yields higher accuracy at lower cardinalities compared to functions that switch to HLL approximation earlier.

Why uniqCombined64 Excels:

  • High Accuracy: Typically achieves accuracy around 99.2% (~0.8% error), sufficient for many analytics use cases.
  • Predictable Memory: Uses constant memory (often around 80KB per aggregation state), preventing runaway RAM consumption.
  • Scalability: Efficiently handles billions of unique values.
  • Speed: Significantly faster than exact methods (e.g., potentially 250ms vs. 10+ seconds for 1 billion uniques) by avoiding heavy hash set operations.

Many teams initially opt for perfect accuracy with uniqExact but later find that the high accuracy and performance stability of uniqCombined64 is a better trade-off, preventing database OOM issues and ensuring predictable query times.

However, while approximation solves the memory and computation bottleneck of high cardinality, it doesn’t inherently solve the issue of scanning billions of raw rows if that remains a problem.

2. Pre-aggregation with Materialized Views

When exact counts are non-negotiable, or when query speed is paramount even with approximation, pre-aggregation using materialized views is the solution. This involves pre-calculating intermediate results.

Materialized View Schema (Example):
Stores daily unique viewer states per post.

// DESCRIPTION: Materialized daily unique viewers per post
// SCHEMA:
//  `date` Date,
//  `post_id` String,
//  `unique_viewers_state` AggregateFunction(uniqExact, Int64) // Stores the intermediate state
// ENGINE: AggregatingMergeTree
// ENGINE_PARTITION_KEY: toYYYYMM(date)
// ENGINE_SORTING_KEY: date, post_id

Materialized View Population Logic (Example):
Calculates the daily state from raw views.

-- DESCRIPTION: Pre-aggregates unique viewers per post daily using exact counting state
-- NODE daily_unique_viewers
SELECT
    toDate(timestamp) as date,
    post_id,
    uniqExactState(viewer_id) as unique_viewers_state -- Calculate the state
FROM post_views
GROUP BY date, post_id
-- TYPE: materialized
-- DATASOURCE: daily_post_viewers_mv (The materialized view table)

Querying the Materialized View (Example):
Merges the pre-aggregated states for the final count.

-- DESCRIPTION: Fast exact unique viewers using pre-aggregated data
-- NODE optimized_post_analytics
SELECT
    post_id,
    uniqExactMerge(unique_viewers_state) as unique_viewers -- Merge states for final count
FROM daily_post_viewers_mv
WHERE post_id = 'some_specific_post_id' -- Parameterized
  -- Optional date range filtering:
  -- AND date >= 'start_date'
  -- AND date <= 'end_date'
GROUP BY post_id
-- TYPE: endpoint

This approach offers:

  • Exact Counts: Maintains perfect accuracy.
  • Drastically Reduced Query Time: Queries operate on much smaller, pre-aggregated data.
  • Lower Query Memory Usage: Merging daily states requires less memory than processing all raw viewer IDs.
  • Real-time Updates: Materialized views can be updated continuously as new data arrives.
  • Trade-off: Less flexibility for arbitrary time ranges not aligned with the aggregation interval (e.g., querying a specific 3-hour window if aggregation is daily).

Combining Approaches for Maximum Scalability

For the most demanding scenarios, combining approximation and pre-aggregation offers the best of both worlds. A common hybrid strategy involves:

  1. Pre-aggregating unique viewer counts daily using an approximate function’s state (uniqCombined64State).
  2. Querying this materialized view for full days within the requested date range.
  3. Querying the raw data table (using uniqCombined64) only for the partial start and end days of the date range, if necessary.
  4. Merging the results from the materialized view and the raw data queries.

Materialized View Schema (Hybrid Example):
Uses the approximate function’s state.

// DESCRIPTION: Materialized daily unique viewers per post (approximate state)
// SCHEMA:
//  `date` Date,
//  `post_id` String,
//  `unique_viewers_state` AggregateFunction(uniqCombined64, Int64) // Approximate state
// ENGINE: AggregatingMergeTree
// ENGINE_PARTITION_KEY: toYYYYMM(date)
// ENGINE_SORTING_KEY: date, post_id

Materialized View Population Logic (Hybrid Example):
Uses uniqCombined64State.

-- DESCRIPTION: Materializes daily unique viewers per post using uniqCombined64 state
-- NODE daily_post_viewers_approx_mv_populator
SELECT
    toDate(timestamp) as date,
    post_id,
    uniqCombined64State(viewer_id) as unique_viewers_state
FROM post_views
GROUP BY date, post_id
-- TYPE: materialized
-- DATASOURCE: daily_post_viewers_approx_mv

Query Endpoint (Hybrid Example):
This logic is more complex, involving conditional querying based on the date range provided.

-- DESCRIPTION: API counting unique viewers, using MV for full days, raw data for partial days.

-- NODE full_days_query (Conditional: Runs if start_date or end_date provided)
-- Gets state from MV for complete days between start and end dates.
SELECT post_id, unique_viewers_state
FROM daily_post_viewers_approx_mv
WHERE post_id = 'some_id' AND date > startDate AND date < endDate
GROUP BY post_id

-- NODE start_day_query (Conditional: Runs if start_date provided)
-- Gets state from raw data for the partial start day.
SELECT post_id, uniqCombined64State(viewer_id) as unique_viewers_state
FROM post_views
WHERE post_id = 'some_id' AND toDate(timestamp) = toDate(startDate) AND timestamp >= startDate
GROUP BY post_id

-- NODE end_day_query (Conditional: Runs if end_date provided)
-- Gets state from raw data for the partial end day.
SELECT post_id, uniqCombined64State(viewer_id) as unique_viewers_state
FROM post_views
WHERE post_id = 'some_id' AND toDate(timestamp) = toDate(endDate) AND timestamp <= endDate
GROUP BY post_id

-- NODE final_aggregation (Endpoint Logic)
-- Merges states from full_days, start_day, end_day if dates provided.
-- If no dates provided, falls back to querying raw data for the post_id.
SELECT
    post_id,
    uniqCombined64Merge(unique_viewers_state) as unique_viewers
FROM (
    -- UNION ALL results from full_days_query, start_day_query, end_day_query
    -- based on whether start/end dates were provided
)
GROUP BY post_id
-- Fallback logic if no dates:
-- SELECT post_id, uniqCombined64(viewer_id) FROM post_views WHERE post_id = 'some_id' GROUP BY post_id

-- TYPE: endpoint

(Note: The above SQL snippets illustrate the logic; actual implementation depends on the specific platform’s syntax for conditional execution and data pipelines.)

This hybrid approach provides fast queries for common date ranges via the materialized view while retaining flexibility for arbitrary ranges by querying raw data only for the boundaries, balancing performance and adaptability.

Choosing the Right Strategy: A Phased Approach

Selecting the best method depends on scale and requirements:

  1. Start Simple: Begin with an exact counting function (uniqExact). Monitor performance.
  2. Quick Fix (Memory Issues): If memory usage or exact count performance becomes problematic, switch to an approximate function (uniqCombined64).
  3. Scale Up (Query Speed): If query latency (even with approximation) is too high due to scanning volume, implement pre-aggregation using materialized views (with either exact or approximate states).
  4. Go Hybrid (Flexibility & Speed): For maximum performance across various query patterns, combine pre-aggregation (likely with approximate states) for common intervals with query-time calculation on raw data for boundaries or less common ranges.

Counting unique items at massive scale is achievable, but it requires moving beyond basic methods and embracing optimizations like approximation and pre-aggregation as data volumes and cardinality grow.


At Innovative Software Technology, we understand the complexities of building scalable, real-time analytics systems capable of handling massive data volumes, such as trillions of events. If you’re facing challenges with database performance, query latency, or infrastructure costs related to unique counting or other big data processing tasks, we can help. Our expertise lies in architecting robust solutions using advanced techniques like approximate counting algorithms (HyperLogLog) and materialized views for pre-aggregation. We optimize your data pipelines for extreme scale, ensuring you get fast, reliable insights from your real-time analytics while managing resources efficiently. Let Innovative Software Technology design the high-performance big data solution your business needs to thrive.

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed