Skip to main content
Significance Report

Understand the statistical impact of experiments or segments.

Updated over 2 months ago

What is Significance?

Kubit’s Significance Report offers statistical insights to verify if a finding in a Measure, Query, Funnel or Retention is significant (“whether a result is likely due to chance or to some factor of interest”).

We don’t support Significance for other Reports like Path, Data Table or Cohort Compare.

It is a general-purpose tool that doesn’t necessarily require conducting formal A/B tests. You can use this feature in a strict way with your experiment data (A/B test assignment), or compare between any two groups (breakdowns or segments) as if they had different treatments.

The major steps in building a Significance Report are:

  1. Select the Mode you want to use in your analysis

  2. Identify the Control and Variant(s) you are going to analyze

  3. Select the additional parameters of your analysis based on items like one or two tailed, p-value and confidence intervals

  4. Select the Metric you want to analyze the lift of between your Control and Variant(s).

    1. You can selected Saved Measures, Query, Funnel and Retention reports.

  5. Apply any filters if needed and set your date range

  6. Execute!

Once executed you will see all the relevant statistical information.

Significance Modes Summarized

In Kubit's Significance Report you're able to analyze experiment type data two ways.

  1. Experiment (Optional): Analyze the significance of a metric based on Experiment IDs and Variant IDs currently being shared with Kubit.

    1. This Mode is only available for Customers that have shared specific Experiment type data with Kubit. If you do not have this available don't worry! You are still able to use the other 2 Modes.

  2. Breakdown: Analyze the significance of a metric based on a field and the values within it.

    1. This Mode is best if you have a field that denotes an experiment or different experience that you typically breakdown by.

    2. Note that the breakdown value must be present in the resulting measure. More about that below.

  3. Segments: Analyze the significance of a metric based on cohorts of users.

    1. This mode is best if you assign users to experiments or different experiences with a single "assignment" event.

    2. Create a Cohort of users who saw that assignment event with specific IDs to segment your control and variate groups.

Experiment Mode

Note that this Mode is only available for customers with the following fields:

  • Experiment ID or Name

  • AND a Variant ID or Name

  • Values are mapped to the User and not a specific event (i.e. an assignment or audience event)

    • If your experiment data is mapped to a specific event we recommend you use the Segment Mode described below.

In Experiment Mode you will build your analysis by selecting the relevant Experiment ID and related Experiment Variants to isolate the correct audiences.

Kubit will map these fields according to your data and this Mode will be visible once mapped and enabled.

Breakdown Mode

When you're analyzing the lift or impact of metrics and aren't capturing typical experiment information you're still able to measure lift from the other values within your dataset. Some examples include User Type, Platform, Subscription Type, Country etc.

Within Significance you are able to built a report based on the breakdown of a Field value.

If your breakdown value has more than 2 groups you're able to add additional Variants with the "+ Add Variant" option.

Segment Mode

Similar to Breakdown mode however instead of a Field you are Segmenting users based on their presence in a Saved Cohort.

First you'll want to build and save the Cohorts you want to compare using the Cohort options, read more here.

Cohorts that have not been Saved will not appear in the Segment Mode dropdown.

Once the Cohorts have been built you will select them into the relevant Segments of Control and Variant(s).

Other Selection Options

No matter the Mode you decide to use there are 3 options available to customize the parameters of your analysis.

Hypothesis Test

One-Tail Test

  • Looks for an effect in one direction (e.g., better or worse).

Two-Tail Test

  • Looks for an effect in both directions (e.g., different, whether better or worse).

Use a one-tail test for a specific direction and a two-tail test for any difference.

P-Value

Description

  • The p-value tells us how likely it is to see our data, or something more extreme, if the null hypothesis is true.

  • Range: It goes from 0 to 1.

How to Interpret P-Value

If the p-value is less than 0.05, it means our result is statistically significant, and we should consider rejecting the null hypothesis.

Confidence Level

Description

  • The confidence level indicates the likelihood that observed differences between versions A and B are real, not random.

  • Range: Typically given as a percentage (e.g., 95% confidence level).

How to Interpret Confidence Level

A 95% confidence level means we are 95% confident that the true difference between versions A and B lies within the calculated interval.

Metric Selection

Once you've established the variants to be analyzed against one another now you must select the metric you are interested in seeing the impact on. To use an example, if you are running an A/B test on an experience to improve check out events then the "Check Out" event count would be the metric you'd select.

A Metric should be relevant to the analysis and/or experiment you've selected and be something that enough users performed to yield a significant result. If you select a metric that is too unique or too few users perform there may not be enough data to product a meaningful result.

Metric Options

  • Measure: Select from a list of Saved Measures to compare the results of each significance group.

  • Query: Select from a list of all Queries (that meet the criteria below) to compare the resulting Measure.

    • The Query cannot be built using Histogram or Impact Analysis modes.

    • The Query cannot use a Rolling Window.

  • Funnel: Select from a list of all Funnels (must be in Conversion Mode) to compare the resulting Conversion Rates.

  • Retention: Select from a list of all Retention curves (must be in Retention Mode) to compare the resulting Retention Rates.

Things to Consider

  • Kubit will overwrite dates in the original query and replace them with the dates selected in the Significance Report you've built. This ensures the metric results are in-line with the significance date range you are interested in.

  • Any Filters in any part of the source Metric will be applied to the Significance Analysis.

  • If the Experiment/Breakdown/Segment you've selected has too few users you will see an error noting this and results will not be returned.

Filter Your Analysis

You will be able to Filter based on a Global Filter or Cohort Filter like you do in all other Kubit reports. These filters will be applied before any statistical analysis has been performed.

Selecting Your Date Range

Once you've made all your inputs you'll select the date range that corresponds with the duration of the experiment you want to analyze.

As mentioned above, Kubit will overwrite the dates from a Metric sourced Query/Funnel/Retention report to those of the significance date range. Don't worry, we don't change the underlying Metric report logic.

All results will be displayed as "All Time" and be an aggregation of the entire duration of the date range.

Interpreting Your Results

Kubit will display results in the following way, and hovering over the Variant's displayed will show more detailed metrics outlined in the Glossary of Terms Used section of this article.

You will be able to add these results to a Dashboard and/or Workspace.


Glossary of Terms Used

Hypothesis Test

One-Tail Test: Looks for an effect in one direction (e.g., better or worse).
Two-Tail Test: Looks for an effect in both directions (e.g., different, whether better or worse).

Use a one-tail test for a specific direction and a two-tail test for any difference.

P-Value

Description: The p-value tells us how likely it is to see our data, or something more extreme, if the null hypothesis is true.

Range: It goes from 0 to 1.

Interpretation: If the p-value is less than 0.05, it means our result is statistically significant, and we should consider rejecting the null hypothesis.

Confidence Level

Description: The confidence level indicates the likelihood that observed differences between versions A and B are real, not random.

Range: Typically given as a percentage (e.g., 95% confidence level).

Interpretation: A 95% confidence level means we are 95% confident that the true difference between versions A and B lies within the calculated interval.

Lift

Description: Lift, or relative performance, quantifies how much better one option performs compared to another.

Range: Typically starts at 0; values above 1 indicate improvement, below 1 indicate worse performance.

Interpretation: Higher lift values show greater improvement of the new option over the baseline.

Lift Confidence Interval

Description: Lift CI shows the potential range of change around a measured value.

Range: It indicates how much actual performance might vary from reported improvement or decline.

Interpretation: Lift CI helps gauge uncertainty in measured outcomes or impacts, providing a statistical boundary for expected changes.

Stat Sig

Description: The p-value tells us how likely it is to see our data, or something more extreme, if the null hypothesis is true.

Range: It goes from 0 to 1.

Interpretation: If the p-value is less than 0.05, it means our result is statistically significant, and we should consider rejecting the null hypothesis.

Sample Size

Description: Sample size is the number of participants or observations in each group. It affects result reliability.

Range: Depends on the study; usually, bigger is better.

Interpretation: Larger sample sizes make results more reliable and differences easier to detect. Smaller sizes may be less reliable and harder to show differences.

Mean

Description: The mean is the average outcome of a measurement (like clicks or purchases) for each group tested.

Range: Varies depending on the measurement and group characteristics.

Interpretation: The mean helps compare average performance between groups, revealing differences or similarities in measured outcomes.

Std Deviation

Definition: Standard deviation measures how spread out the numbers in a set are. It shows whether the numbers are close to the average (mean) or scattered far from it.

Range: Can vary from a small number (close to zero) to a larger number.

Interpretation: A small standard deviation means the numbers are close to the average. A large standard deviation means the numbers are spread out over a wider range.

Delta

Description: Delta percentage, or absolute performance, shows the exact difference in performance between two options. It focuses only on the actual change, not on how it compares to other factors.

Range: It can be positive or negative, depending on whether there is an improvement or a decline.

Interpretation:
Positive Delta: Performance has improved.
Negative Delta: Performance has declined.
Zero Delta: No change in performance.

Test Score for Query, Retention and Measures

Test Score (Welch's t-test)

Description: Measures the difference between two group means relative to the spread or variability of their scores.

Range: Any real number.

Interpretation: A higher absolute value indicates a greater difference between groups.

Test Score for Funnel

Test Score (Chi-Square Statistic)

Description: Measures the strength of association between two categorical variables.

Range: 0 to Infinity.

Interpretation: Larger values indicate a greater discrepancy between observed and expected values, suggesting a stronger association between variables.

P-Value

Description: The p-value tells us how likely it is to see our data, or something more extreme, if the null hypothesis is true.

Range: It goes from 0 to 1.

Interpretation: If the p-value is less than 0.05, it means our result is statistically significant, and we should consider rejecting the null hypothesis.

Power (# of exp)

Power

Description: Statistical power tells you how likely you are to find a real effect in your test if there is one.

Range: 0 to 1 for power; any positive number for simulations.

Interpretation: Higher power means you're more likely to detect a true effect. More simulations make your power estimate more reliable.

Did this answer your question?