Data observability and alerting in plain SQL
Patterns and practical examples of SQL based alerts for data pipelines
I have seen quite a few resources on alerting patterns for traditional backend web applications, but haven’t seen many tailored to data pipelines. Even though the same principles apply, a large class of errors in data pipelines can only be detected by analyzing the resulting data.
A really simple way of monitoring your data pipelines is by using plain SQL queries, integrated with an alerting system. I started using this method years ago when I began working on stream/batch processing systems, and over time found it to be a surprisingly robust way of monitoring data.
In this blog post, I will share some basic SQL alerting patterns that I have found useful, and hopefully, you all find useful as well.
Data Staleness
Some datasets are expected to be up to date for some period, and you will usually document this in an SLA for downstream consumers. For example, I worked on an advertising metrics system for many years that had an SLA of 24 hours for fresh data, and we needed to notify consumers if there were any delays.
A really simple way of alerting when we approach our SLA is to run a SQL query like this
SELECT
DATEDIFF(HOUR, MAX(event_date_time_utc), CURRENT_TIMESTAMP()) < 12
FROM
source_table;
It returns a boolean on whether there is data in the table for events that occurred within the last 18 hours. If this query returns False, we will alert. Note that even though our SLA was 24 hours, we alerted our internal team well before then to give us time to push any fixes if need be. If we were certain we would not meet our SLA we would notify our consumers. It was standard practice for us to have these types of alerts on all of our consumer-facing datasets, all with different thresholds depending on the SLA.
Data Duplicates
This is a common alert, especially when working with OLAP data stores that do not enforce primary keys. Redshift, Snowflake, Iceberg and Deltalake are all various databases and table storage formats I have used that don’t have any primitives to enforce the uniqueness of data. For this reason, we needed some way of detecting duplicates early, and a really easy way to do this was with a basic SQL query. Below I will show three different methods I have used depending on the use case.
Example 1: Existence of duplicates
SELECT
COUNT(DISTINCT event_id) AS distinct_cnt,
COUNT(*) AS cnt
FROM
events
HAVING
ABS(cnt - distinct_cnt) > 0
This query just counts the total records in the table and compares it with the distinct count of the unique key on the table.
Example 2: Return the duplicate records
If you would also like to know which records are duplicates the following query will return all event_ids that have duplicates.
SELECT
event_id,
COUNT(*) AS event_count
FROM
events
GROUP BY
event_id
HAVING
event_count > 1
Note that if the unique key for your table is not a single column, you can simply group by your entire composite key instead. (The same method can be used for example 1)
SELECT
key1,
key2,
...
keyn,
COUNT(*) AS event_count
FROM
events
GROUP BY
key1, key2, ..., keyn
HAVING
event_count > 1
Example 3: Approximation for large datasets
For very large datasets the above strategy may take a long time to compute. An alternative is to use APPROX_COUNT_DISTINCT as an approximation and alert when the difference between count, and APPROX_COUNT_DISTINCT is greater than the error threshold of your APPROX_COUNT_DISTINCT implementation. In this example, we are assuming the error threshold is 2.5% for this implementation. APPROX_COUNT_DISTINCT also takes an optional parameter that can reduce the error rate but also affects the performance of the query. Depending on the SQL engine you use, this function may be implemented differently with a different name. But all these engines use the HyperLogLog algorithm under the hood.
WITH counts AS (
SELECT
APPROX_COUNT_DISTINCT(line_item_id) AS distinct_count,
COUNT(*) AS total_count
FROM
revinfra.ad_rev_rec_event
)
SELECT
distinct_count,
total_count
FROM
counts
WHERE
ABS(total_count - distinct_count) > (total_count * 0.025);
Gaps in Continuous Data
Imagine you have some data stream where you expect events to occur fairly continuously, for example, ad impressions on some large platform like Google. With the volume of searches a day, there should never be a 10-minute time period where there aren’t ad impressions recorded.
WITH event_gaps AS (
SELECT
event_date_time_utc,
LEAD(event_date_time_utc) OVER (ORDER BY event_date_time_utc) AS next_event_time,
DATEDIFF(MINUTE, event_date_time_utc, LEAD(event_date_time_utc) OVER (ORDER BY event_date_time_utc)) AS gap_minutes
FROM
ad_impressions
)
SELECT
event_date_time_utc AS start_time,
next_event_time AS end_time
FROM
event_gaps
WHERE
gap_minutes > 10;
The query uses a CTE that creates a mapping between every event, and the timestamp of the following event using the LEAD window function. It saves the difference for each row as gap_minutes. The main query then checks that the gap_minutes is never greater than 10 minutes.
In my experience, I have found this type of alert really useful for detecting missing batches of data in an upstream service. For example, the upstream data we consumed for a specific application originated from Kafka events that were ETLed to S3 for us to consume. There was a single failed write from Kafka to S3 for a small period, and this alert caught the issue and we were able to recover these events later. In theory, this failure should have been caught by the upstream team, but 🤷, all that matters is that we caught it.
I have also found this alert to be useful in catching bugs I deployed in my own application. There was a time when I deployed a code change that was passing along the wrong timestamp, making it seem like data was missing at specific hours. Instead of passing along the event_date_timestamp (that represented when the event occurred), I passed along the job_run timestamp which was the same for every event processed in the same run 🤦. Luckily, this alert caught the issue and we fixed our data before we affected downstream consumers.
Data Anomalies
If you have a dataset that is relatively consistent over time, it may be useful to be alerted of any anomalies. This class of alerts is the trickiest to get right because the utility of your alert depends on the thresholds you set. A threshold that is too strict will cause you to miss real issues, while a threshold that is too lenient will cause the alert to be noisy.
This post doesn’t go into the process of figuring out those thresholds, but below I show you some sample queries that you can use to alert you of anomalies. For the examples below assume I have a table revenue_events
with columns event_date_time_utc
and revenue
.
Note: The following queries assume that your dataset follows a normal distribution.
Example: Abnormally High/Low Values
WITH stats AS (
SELECT
AVG(revenue) AS avg_amount,
STDDEV(revenue) AS std_amount
FROM
revenue_events
)
SELECT
event_date_time_utc,
revenue
FROM
revenue_events
JOIN
stats
WHERE
revenue > (avg_amount + (3 * std_amount))
OR revenue < (avg_amount - (3 * std_amount));
The CTE checks for the average revenue in the table, and the standard deviation using the AVG and STDDEV functions. The main query then searches for records that are greater than 3 standard deviations from the average.
Example: Value Outside Percentile Range
WITH percentile_bounds AS (
SELECT
PERCENTILE_APPROX(revenue, 0.05) OVER () AS lower_bound,
PERCENTILE_APPROX(revenue, 0.95) OVER () AS upper_bound
FROM
revenue_events
)
SELECT
event_date_time_utc,
revenue
FROM
revenue_events
JOIN
percentile_bounds
WHERE
revenue NOT BETWEEN lower_bound AND upper_bound;
This query searches for records in the bottom and top 5 percentiles of the total distribution and on its own may not be a particularly useful alert.
In practice, this query should likely be matched with time bounds to make sure you are only alerting for responses that should be looked into. For example, if you are a seasonal business, you might want to alert that you got a revenue event in the bottom fifth percentile during a historically high volume time.
Example: Sudden Spikes or Drops
WITH revenue_diff AS (
SELECT
event_date_time_utc,
revenue,
LAG(revenue) OVER (ORDER BY event_date_time_utc) AS prev_revenue,
revenue - LAG(revenue) OVER (ORDER BY event_date_time_utc) AS diff
FROM
revenue_events
)
SELECT
event_date_time_utc,
revenue,
prev_revenue,
diff
FROM
revenue_diff
WHERE
ABS(diff) > (
SELECT
AVG(ABS(diff)) + 3 * STDDEV(ABS(diff))
FROM
revenue_diff
);
The CTE first calculates the difference between each event’s revenue and the event directly preceding it. The main query then calculates if any of the diffs are outside 3 standard deviations of the average diff.