Basic time-series operations
ClickHouse provides several methods for working with time series data, allowing you to aggregate, group, and analyze data points across different time periods.
This section covers the fundamental operations commonly used when working with time-based data.
Common operations include grouping data by time intervals, handling gaps in time series data, and calculating changes between time periods.
These operations can be performed using standard SQL syntax combined with ClickHouse's built-in time functions.
We're going to explore ClickHouse time-series querying capabilities with the Wikistat (Wikipedia pageviews data) dataset:
CREATE TABLE wikistat
(
`time` DateTime,
`project` String,
`subproject` String,
`path` String,
`hits` UInt64
)
ENGINE = MergeTree
ORDER BY (time);
Let's populate this table with 1 billion records:
INSERT INTO wikistat
SELECT *
FROM s3('https://ClickHouse-public-datasets.s3.amazonaws.com/wikistat/partitioned/wikistat*.native.zst')
LIMIT 1e9;
Aggregating by time bucket
The most popular requirement is to aggregate data based on periods, e.g. get the total amount of hits for each day:
SELECT
toDate(time) AS date,
sum(hits) AS hits
FROM wikistat
GROUP BY ALL
ORDER BY date ASC
LIMIT 5;
┌───────date─┬─────hits─┐
│ 2015-05-01 │ 25524369 │
│ 2015-05-02 │ 25608105 │
│ 2015-05-03 │ 28567101 │
│ 2015-05-04 │ 29229944 │
│ 2015-05-05 │ 29383573 │
└────────────┴──────────┘
We've used the toDate()
function here, which converts the specified time to a date type. Alternatively, we can batch by an hour and filter on the specific date:
SELECT
toStartOfHour(time) AS hour,
sum(hits) AS hits
FROM wikistat
WHERE date(time) = '2015-07-01'
GROUP BY ALL
ORDER BY hour ASC
LIMIT 5;
┌────────────────hour─┬───hits─┐
│ 2015-07-01 00:00:00 │ 656676 │
│ 2015-07-01 01:00:00 │ 768837 │
│ 2015-07-01 02:00:00 │ 862311 │
│ 2015-07-01 03:00:00 │ 829261 │
│ 2015-07-01 04:00:00 │ 749365 │
└─────────────────────┴────────┘
The toStartOfHour()
function used here converts the given time to the start of the hour.
You can also group by year, quarter, month, or day.
Custom grouping intervals
We can even group by arbitrary intervals, e.g., 5 minutes using the toStartOfInterval()
function.
Let's say we want to group by 4-hour intervals.
We can specify the grouping interval using the INTERVAL
clause:
SELECT
toStartOfInterval(time, INTERVAL 4 HOUR) AS interval,
sum(hits) AS hits
FROM wikistat
WHERE date(time) = '2015-07-01'
GROUP BY ALL
ORDER BY interval ASC
LIMIT 6;
Or we can use the toIntervalHour()
function
SELECT
toStartOfInterval(time, toIntervalHour(4)) AS interval,
sum(hits) AS hits
FROM wikistat
WHERE date(time) = '2015-07-01'
GROUP BY ALL
ORDER BY interval ASC
LIMIT 6;
Either way, we get the following results:
┌────────────interval─┬────hits─┐
│ 2015-07-01 00:00:00 │ 3117085 │
│ 2015-07-01 04:00:00 │ 2928396 │
│ 2015-07-01 08:00:00 │ 2679775 │
│ 2015-07-01 12:00:00 │ 2461324 │
│ 2015-07-01 16:00:00 │ 2823199 │
│ 2015-07-01 20:00:00 │ 2984758 │
└─────────────────────┴─────────┘
Filling empty groups
In a lot of cases we deal with sparse data with some absent intervals. This results in empty buckets. Let's take the following example where we group data by 1-hour intervals. This will output the following stats with some hours missing values:
SELECT
toStartOfHour(time) AS hour,
sum(hits)
FROM wikistat
WHERE (project = 'ast') AND (subproject = 'm') AND (date(time) = '2015-07-01')
GROUP BY ALL
ORDER BY hour ASC;
┌────────────────hour─┬─sum(hits)─┐
│ 2015-07-01 00:00:00 │ 3 │ <- missing values
│ 2015-07-01 02:00:00 │ 1 │ <- missing values
│ 2015-07-01 04:00:00 │ 1 │
│ 2015-07-01 05:00:00 │ 2 │
│ 2015-07-01 06:00:00 │ 1 │
│ 2015-07-01 07:00:00 │ 1 │
│ 2015-07-01 08:00:00 │ 3 │
│ 2015-07-01 09:00:00 │ 2 │ <- missing values
│ 2015-07-01 12:00:00 │ 2 │
│ 2015-07-01 13:00:00 │ 4 │
│ 2015-07-01 14:00:00 │ 2 │
│ 2015-07-01 15:00:00 │ 2 │
│ 2015-07-01 16:00:00 │ 2 │
│ 2015-07-01 17:00:00 │ 1 │
│ 2015-07-01 18:00:00 │ 5 │
│ 2015-07-01 19:00:00 │ 5 │
│ 2015-07-01 20:00:00 │ 4 │
│ 2015-07-01 21:00:00 │ 4 │
│ 2015-07-01 22:00:00 │ 2 │
│ 2015-07-01 23:00:00 │ 2 │
└─────────────────────┴───────────┘
ClickHouse provides the WITH FILL
modifier to address this. This will fill out all the empty hours with zeros, so we can better understand the distribution over time:
SELECT
toStartOfHour(time) AS hour,
sum(hits)
FROM wikistat
WHERE (project = 'ast') AND (subproject = 'm') AND (date(time) = '2015-07-01')
GROUP BY ALL
ORDER BY hour ASC WITH FILL STEP toIntervalHour(1);
┌────────────────hour─┬─sum(hits)─┐
│ 2015-07-01 00:00:00 │ 3 │
│ 2015-07-01 01:00:00 │ 0 │ <- new value
│ 2015-07-01 02:00:00 │ 1 │
│ 2015-07-01 03:00:00 │ 0 │ <- new value
│ 2015-07-01 04:00:00 │ 1 │
│ 2015-07-01 05:00:00 │ 2 │
│ 2015-07-01 06:00:00 │ 1 │
│ 2015-07-01 07:00:00 │ 1 │
│ 2015-07-01 08:00:00 │ 3 │
│ 2015-07-01 09:00:00 │ 2 │
│ 2015-07-01 10:00:00 │ 0 │ <- new value
│ 2015-07-01 11:00:00 │ 0 │ <- new value
│ 2015-07-01 12:00:00 │ 2 │
│ 2015-07-01 13:00:00 │ 4 │
│ 2015-07-01 14:00:00 │ 2 │
│ 2015-07-01 15:00:00 │ 2 │
│ 2015-07-01 16:00:00 │ 2 │
│ 2015-07-01 17:00:00 │ 1 │
│ 2015-07-01 18:00:00 │ 5 │
│ 2015-07-01 19:00:00 │ 5 │
│ 2015-07-01 20:00:00 │ 4 │
│ 2015-07-01 21:00:00 │ 4 │
│ 2015-07-01 22:00:00 │ 2 │
│ 2015-07-01 23:00:00 │ 2 │
└─────────────────────┴───────────┘
Rolling time windows
Sometimes, we don't want to deal with the start of intervals (like the start of the day or an hour) but window intervals.
Let's say we want to understand the total hits for a window, not based on days but on a 24-hour period offset from 6 pm.
We can use the date_diff()
function to calculate the difference between a reference time and each record's time.
In this case, the day
column will represent the difference in days (e.g., 1 day ago, 2 days ago, etc.):
SELECT
dateDiff('day', toDateTime('2015-05-01 18:00:00'), time) AS day,
sum(hits),
FROM wikistat
GROUP BY ALL
ORDER BY day ASC
LIMIT 5;
┌─day─┬─sum(hits)─┐
│ 0 │ 25524369 │
│ 1 │ 25608105 │
│ 2 │ 28567101 │
│ 3 │ 29229944 │
│ 4 │ 29383573 │
└─────┴───────────┘