What is Sampling?
When a report result is Sampled it refers to taking a random sampling of data from the total result set. In Kubit we allow you to Sample two ways:
Sample based on Events performed
This means the random sample will be taken from the events with no regard for User, Session, Device ID etc.
Only available for customers using Snowflake due to restrictions on other cloud warehouses. It's either not supported or would result in severe performance penalties.
This is best used when you want to sample based on number of times an Event has been performed.
Sample based on Subject (User ID, Session ID etc.)
These Subjects are custom to your Kubit environment
Sampling based on Subjects will mean that all events performed by that Sampled user will be returned in the results.
Available for all customers on all warehouses
This ensures the results are complete and users aren't missing events they performed in every step of the analysis (e.g. Funnel, Retention etc).
Subject Sampling is typically used in Funnel, Retention, Path and Cohort reports.
Why should I Sample?
Most of the time Sampling isn't necessary and only required when the number of rows returned exceeds our query timeout window (15 min). This means the query took too long to return results because it's too large.
If you know you're looking back over a very long period of time (i.e. years) and/or the number of events or users is going to be exceedingly large (i.e. 100M+) it's best practice to Sample so you're able to see the results in the Sampled report.
Also Sampling is valuable when you are ok with not 100% accurate results and the Sampled results of 20% is valid.
How to Sample in Kubit
In all reports you will find Sampling Rate controls when you click on "Additional Settings" at the top of the report builder.
Enable the Sampling Rate settings and select the appropriate rate. You can choose from the drop down list or input your own value.
Sampling in Query
In our Query report you are able to sample the results of your query by setting a Sample Rate.
All Measure functions will result in Event level Sampling and is only available for Snowflake Customers.
Your results will only represent the sampled % of the complete Events (100%). If your events are evenly distributed, the metric from the sample may be very close to the metric from the full data set.
Example:
Query Measure | 100% Results | 25% Sampled Results |
Count Events : Any Event | 100,000 Events performed | 25,000 Events performed |
Unique Users : Any Event | 3,500 Users | Approximately 875 Users (depending on user distribution across the events) |
In use cases of Histogram or Impact Analysis mode in Query, where a subject will be targeted, Subject level sampling will be applied. In this case, the sample % of the unique subjects (e.g. users) will be selected from the event data for analysis.
Sampling in Funnel, Retention and Path
In our Funnel, Retention and Path reports you are able to sample the results of your query by setting a Sample Rate based on the Subject you've selected.
Select your Subject and that is the item that will be Sampled against.
Kubit uses consistent hashing of subject_id (e.g. user_id) in order to guarantee that same users will be analyzed across the steps (e.g. in Funnel, Retention, Path, and Query in Histogram/Impact mode)
Your results only represent the sampled % of the subjects from the complete event data(100%). If your events are evenly distributed, the metric from the sample may be very close to the metric from the full data set.
Example Funnel:
We will build a Funnel based on User ID and Sample at 50%
Step Event | 100% Results & Conversion Rate | 50% Results & Conversion Rate |
Login | 100,000 | 100% | ~50,000 | 100% |
View Home | 30,000 | 30% | ~15,000 | 30% |
Add to Cart | 10,000 | 33% | ~5,000 | 33% |
Purchase | 5,000 | 50% | ~2,500 | 50% |
You'll notice that the conversion rates remain the same since we are sampling users and not the Step Event values.
Sampling in Cohorts
When building a Cohort in the Segment, Cohort Filter or Cohort Builder features Kubit requires a subject, the method of consistent hashing of subject_id (e.g. user_id) is used.
Segment
When a segment is referenced in a report, a sampling rate can be specified to reduce the amount of subjects in the segment. This may significantly reduce query time.
Inspect
When inspecting a cohort, sampling rate can also be applied to reduce returned data volume.
Custom Sampling Strategy
The default sampling mechanism is a universal solution which completely relies on the data warehouse's random functions and requires no data model changes. For customers who wants full control and/or fastest performance, Kubit offers flexible custom sampling strategies which can be tailored to their specific use cases:
Sample by a specific property. e.g. user_id hash or substring.
Best performance can be achieved when this property is part of the cluster key (index)
Control sampling using a separate dimension table. e.g. assign certain users into a holdout group who will always be included in the sampling.
Apply different sampling strategies for each schema or data model
Please reach out to your CSM to learn more about Custom Sampling.