The Data Modeller intends to support self-service data model changes in your Kubit environment. Want to add handling for a new column in your fact table? Create a derived property, add a new dimension table, or even a whole new schema? This article will provide an introduction to the Data Modeler and will help you understand the main concepts behind it and what you can do in it.
Prerequisites
Access to the Data Modeler feature is exclusively available to Direct Connect customers. Customers utilizing Data Share and Managed Service have read-only access and are unable to modify or create data models within the platform.
You have to be a Kubit Admin to access Data Modeler.
Getting Started
You can find the Data Modeler icon in the "Data Management" icon at the top of your screen:
Once you click on it you will see:
a
New Schema
button you can use to create a new schemaa list of existing Kubit Schemas on the left
the default Schema configuration which you can observe or edit by navigating through the tabs:
Configuration
Connection
Tables
Fields
View Model
Configuration
In this view you can see:
General schema settings:
the
Schema Name
(editable)what
Timezone
is being used for this Schemawhen doing reports with a
Time Unit: Weekly
reports - what day of the week is theWeek Start
set to
This is where you will see all of these parameters in action when you are building reports:
Subjects
a read-only list of all Subjects configured for this schema
Sampling
Sampling can be used to speed up reports and to save costs, especially when dealing with very large datasets. To understand more about Sampling in Kubit you can read this detailed article. What you can control from here is the default sampling rates for:
Default: applies to all Query/Funnel/Path/Retention reports
Cohort: applies to all Cohort Filters and Segments
Minimum Refresh Interval
The Minimum Refresh Interval (MRI) dictates how often Events and Fields can be refreshed. Users with a Creator, Governor, or Admin role can refresh these Events or Fields. For example, if the MRI is set to one day, only one user can refresh it per day. Other users who try to refresh it within that time frame will not be able to until the MRI is up.
All Events can be refreshed from the Event drop down.
Field values can be refreshed by selecting a specific measure filter or report filter, and then refreshing the values at the bottom.
Since the values are only updated when a user initiates a refresh, this reduces computational burden on your warehouse.
Additional Funnel and Path Partition Fields
The default Partition
options in Kubit are (more about them - here):
None
Day
Conversion Window
In a nutshell, the Partition
options control the time window within a conversion that happened within a Funnel or a Path. But in some cases, you might be interested in using a non-temporal field for the partitioning such as Session
. This then allows you to configure the Funnel report to measure only the conversions that happened within a single session.
Status
The Enabled
toggle controls whether a Schema is visible for the Kubit users who build Reports or not. Once disabled, users can no longer select this schema from any Schema dropdown.
Set the status to Disabled
when you're working on a new Schema and don't want your teammates to start using it just yet.
Connection
In order to consume data from your data warehouse you need to set up the connection to it. Kubit relies on a JDBC connection to execute Reports. The first step is to choose the warehouse type from a list of supported warehouses:
BigQuery
ClickHouse
Databricks
MySQL
Presto
Redshift
Snowflake
Then, you have to fill out the rest of the parameters which are as follows (with the exception of BigQuery):
Username: which should be used to login to your warehouse (we recommend creating a dedicated user for Kubit)
Password: for the username (we recommend a length of 60 characters, a randomly generated mix of uppercase, lowercase, numbers, and special characters)
Timeout: on the JDBC connection, make sure to check what the timeout configured on your warehouse and set it to the same value or less
JDBC URL: to establish the JDBC connection
End Date Type: what should the date picker default to as the end date - today, yesterday, a static date, or something else? The recommendation is to keep the default (
CURRENT
).
Tables
There are 3 things you can do in this view:
add more facts or tables
edit table (mapping)
create Derived Columns
Add Table
To add a table you start by clicking on the Add Table
button:
Then you see what we call the Schema Browser
where you can navigate between schemas in the database configured in the Connection
tab. In each schema you will see a full list of:
tables
views
materialized views
Once you find the table you want to add you select it, choose whether it's a fact or a dimension table and then hit Confirm
:
When you add a new table you need to set some required fields before you can save the changes to your Schema. You will see a table view of all the columns and their types which you can map to Kubit Fields:
and below that you will find the required fields:
Event: this is the event name
Timestamp: the event timestamp is used for finding the sequence of events in the Funnel or Path
Date: the event date is used for all the date ranges in Kubit
Subjects: you need to add at least one Subject per fact table
In case you don't have an event date column in your table you can create a Derived Column and use it!
Edit Table (Mapping)
Which columns from a table should be used in Kubit? Here's a short demo of how you select the required fields and map a couple of columns:
A few things to call out:
once you set the Required Fields for a Table you cannot change them
you can apply type overrides only when you are adding a new table (e.g. a numeric value is stored as a string but you want to do sums with it in Kubit)
In this view, you can also set the Field name a column should map to and its description.
Create a Derived Column
Derived Columns come to the rescue when you don't have all of the required columns already in place or you want to slightly modify the values in a column. A Derived Column, in a nutshell, allows you to create a new logical column based on other columns that already exist in the table. To start, click on Add Derived Column
:
Then you have 2 ways to define the column:
Raw SQL: most common use cases are when you need to cast a type, concatenate strings, or pick a property from an object (e.g. JSON or Struct)
Mapped String: useful when you need to map an enum, sometimes the database would utilize a numeric type to store some data and you need the human-readable names in your analytics tool - this is when you will use Mapped String
The SQL option supports only single-row operations. If you'd like to use Window Functions, please check out Derived Fields.
Here are a couple of examples, one is using SQL to create a date column from a timestamp column:
And the other is mapping integers to human-readable names:
Update Derived Columns
Derived Column logic can be updated by clicking on the column and editing its definition. This makes it simple to fix definition mistakes and make definition updates to match changing organization requirements.
Fields
This is where you can:
Enable Fields for
Filter
andBreakdown
Disable a Field (make it unavailable for selection in any new Reports)
Change the
Field Name
andDescription
Create
Derived Fields
Enable Fields for Filter and Breakdown
This is as simple as it gets if you want to tweak whether a Field is eligible for Filter
or Breakdown
or not all you have to do is (un)check a checkbox and hit Confirm
.
Disable a Field
In some cases, you might want to Disable a Field. Maybe you're no longer populating a column with data or you don't trust it - whatever the reason, it's again just a click on top of a checkbox and then Confirm
:
Change Field Name and Description
Maybe a Field Name is a bit confusing or ambiguous - nothing to worry about, you can always change it or add a description to make it easier to understand. Here's how:
Derived Fields
There are 2 main scenarios in which you should use Derived Field
instead of a Derived Column
:
you want to apply SQL Window Functions, e.g. to find out the duration of a session by calculating the time difference between the first and last event in a session
you have multiple fact tables which contain the same
Field
so instead of definingDerived Column
in each table, you can define it as aDerived Field
only once
To go back to our session duration example, let's check the Window Functions
checkbox and input this sample SQL:
timestampdiff(second, min(@event_timestamp) over (partition by @session_id), max(@event_timestamp) over (partition by @session_id))
Remember to use @ when referencing a column!
When Window Functions
aren't being used the behaviour is equivalent to Derived Column
, it just may be applied over more than 1 column in case you have multiple fact tables in a union.
View Model
This is where you can see all the tables and how they are related to each other and which columns are mapped to which Fields.