Using SQL to Summarize A/B Experiments

I’ve run app, product, and marketing A/B experiments at tech companies, big and small. While the features and goals varied from experiment to experiment, the SQL I wrote for summarizing the results of those product experiments was the same every time. A design pattern!

👉 Build a user-level table, then calculate summary metrics off that.

This approach feels like second nature to me but I’m surprised how often I needed to show it to others getting started in experimentation. That’s because there isn’t really much out there that explains this approach. So I’m writing this guide!

Even if you use experimentation platform apps (especially those that sit on top of your data warehouse, like Eppo or GrowthBook), I find it vital to understand exactly how these apps pull from different source tables to compute summary metrics. Not only will you know better where their numbers come from, you also can compare your manually-calculated numbers vs. the ones from the app, giving you greater confidence that the app is correctly configured and is working as expected.

This guide assumes that:

Users are your randomization unit (i.e., you’re randomizing users into control and test variants or A/B variants). Some companies use other randomization units such as companies, sessions, or locations.
You have an experiment or feature flag assignment table. This table tells you for each experiment, which users were assigned into which variation at what time. Each row contains a user id, experiment name, feature flag evaluation, and assignment timestamp.
You have a basic dimensional table containing user attributes (i.e., dim_users). For each user id, there’s information about that user such as creation date, city name, signup marketing channel, date of first order, etc.
You’re using Snowflake. I’m using their SQL dialect here but the code can be used on other data warehouses with minor changes.

🧪 The Experiment

Suppose we’re an ice cream delivery service! Sundaes 🍨 bring in more revenue than ice cream cones 🍦, so we’re interested in increasing sundae orders. Our hypothesis was that our customers desired to order sundaes but needed help customizing their sweet concoctions. So our engineers built a sundae wizard, and that feature is what we tested in the experiment.

Our primary metric (i.e., success metric) was the number of sundae orders placed per user, and we predicted that the new sundae wizard would increase this primary metric. Other metrics we monitored were overall number of orders placed, average order value, and user conversion rate. We picked these guardrail metrics because we wanted to check that we weren’t harming the overall user experience and reducing our orders and/or revenue.

Our power analysis suggested we needed to run the experiment for two weeks to reach our target sample size. Our experiment ran from July 1 to July 15, and users were randomly assigned into a control group (no wizard) or a test group (wizard) using a 50/50 split.

🌟 The Process

When I get an experiment like this, my process is:

Build a persistent table that contains all measures needed for calculating summary metrics, aggregated to the user level
Calculate the summary metrics

Yes, that’s it! After you’ve calculated the summary metrics, you can go on to do significance testing, deep dives, and start writing up your report. All of these will depend on you having the summary metrics first.

1. User-Level Table

First, we need to build the user-level table. This table contains a row for each user exposed to the experiment, their attributes, and any relevant measures taken during the experiment period, aggregated to the user grain.

This powerful table serves many functions:

Easily calculate summary metrics. (the most important reason!)
Stable source of truth for who was in the experiment, which variant they saw, and their attributes and measures.
Starting place for deep-dives. You can slice metrics by specific user attributes, or explore behavior within a specific cohort.
Starting place for analyzing event data. You can join your event stream to this table to start counting events and building funnel analyses.

https://gist.github.com/foundinblank/259c2d0d6e66b513c8ecf0694877ee1f

Some key things to keep in mind when building user-level tables:

They should list all users you want to analyze, and none of the users you don’t want to analyze, such as spam users or users exposed to multiple variants.
Aggregated columns should contain actual values, not nulls. If a user placed zero orders during the experiment period, the cnt_orders value should be coalesced to 0 and not left as null. This allows you to calculate averages more safely, because nulls can result in returning the wrong denominator.
User attributes and order measures can change over time. You need your analysis to be reproducible — believe me, you’ll be asked to re-run it! ☝️Save it as a persistent table using thecreate table command. I find that a dedicated schema, experiments, is very useful for storing these tables.

2. Summary Metrics

Once you have a user-level table, you can roll up the metrics to the cohort level (control vs. test). You can further slice it by adding more attributes after cohort and updating the group by line (i.e., slice by marketing channel).

The output of the code below will be your summary metrics — the numbers by which your team will judge the success of the experiment.

https://gist.github.com/foundinblank/4b509275dedc21ad0bb10656b2dc1eed

Looks like the sundae wizard feature might have drove sundae orders! 🍨The control group had 0.78 sundae orders per user and the test group 0.83 sundae orders per user: an increase of 0.05 sundae orders per user. We’ll need to run statistical tests to determine if this was a significant increase or just noise.

A Note About Denominators

You’ll notice that I explicitly calculated averages by writing out both the numerator and denominator, instead of using avg() functions. This makes the code and metric definitions more transparent to others, and reduces the likelihood of (lazy!) errors and wrong metrics.

For example, if you define average sundae orders per user and average orders per user as avg(total_sundae_orders) and avg(total_orders) respectively, you might also, out of habit, define average order value as avg(total_order_value). This is wrong.

We’re working with a user-level table, not an order-level table. The count of users is what gets used as the denominator in avg(total_order_value). But the denominator for calculating average order value is the count of orders, not users! The right equation is total_order_value / cnt_orders.

Which brings me to one of my all-time favorite data science tweets:

https://twitter.com/beeonaposy/status/1235313388575694849

So, avoid avg() functions and write out calculated metrics in full.

Conclusion

If you’re lucky enough to work at a place that runs many experiments, you’ll find yourself writing the same SQL many times. The natural next step is how to automate that. I’ve used a range of different approaches for how to automate user-level tables and summary metrics over an unknown x number of experiments: brute force (processing all experiment/feature flag names), relying on dbt seeds and dbt macros, relying on an in-house experimentation platform, or using experimentation platform apps like Eppo or GrowthBook.

The next steps after calculating summary metrics are testing for significance and reporting lift & incrementality. I’ll write another post about how to do that.

Then comes the most important part. Through deep-dives and good ol’ data analysis work, we build a story about how and why we’ve gotten these metrics. The summary metrics and statistical significance are just starting points; the ✨ valuable insights ✨ lie in figuring out the how and why!