The Data Modeller intends to support self-service data model changes in your Kubit environment. Want to add handling for a new column in your fact table? Create a derived property, add a new dimension table, or even a whole new schema? This article will provide an introduction to the Data Modeler and will help you understand the main concepts behind it and what you can do in it.

Prerequisites

Access to the Data Modeler feature is exclusively available to Direct Connect customers. Customers utilizing Data Share and Managed Service have read-only access and are unable to modify or create data models within the platform.
You have to be a Kubit Admin to access Data Modeler.

Getting Started

You can find the Data Modeler icon in the "Data Management" icon at the top of your screen:

Once you click on it you will see:

a New Schema button you can use to create a new schema
a list of existing Kubit Schemas on the left
the default Schema configuration which you can observe or edit by navigating through the tabs:
- Configuration
- Connection
- Tables
- Fields
- View Model

Configuration

In this view you can see:

General schema settings:

the Schema Name (editable)
what Timezone is being used for this Schema
when doing reports with a Time Unit: Weekly reports - what day of the week is the Week Start set to

This is where you will see all of these parameters in action when you are building reports:

Subjects

a read-only list of all Subjects configured for this schema

Sampling

Sampling can be used to speed up reports and to save costs, especially when dealing with very large datasets. To understand more about Sampling in Kubit you can read this detailed article. What you can control from here is the default sampling rates for:

Default: applies to all Query/Funnel/Path/Retention reports
Cohort: applies to all Cohort Filters and Segments

Minimum Refresh Interval

The Minimum Refresh Interval (MRI) dictates how often Events and Fields can be refreshed. Users with a Creator, Governor, or Admin role can refresh these Events or Fields. For example, if the MRI is set to one day, only one user can refresh it per day. Other users who try to refresh it within that time frame will not be able to until the MRI is up.

All Events can be refreshed from the Event drop down.

Field values can be refreshed by selecting a specific measure filter or report filter, and then refreshing the values at the bottom.

Since the values are only updated when a user initiates a refresh, this reduces computational burden on your warehouse.

Additional Funnel and Path Partition Fields

The default Partition options in Kubit are (more about them - here):

None
Day
Conversion Window

In a nutshell, the Partition options control the time window within a conversion that happened within a Funnel or a Path. But in some cases, you might be interested in using a non-temporal field for the partitioning such as Session. This then allows you to configure the Funnel report to measure only the conversions that happened within a single session.

Status

The Enabled toggle controls whether a Schema is visible for the Kubit users who build Reports or not. Once disabled, users can no longer select this schema from any Schema dropdown.

Set the status to Disabled when you're working on a new Schema and don't want your teammates to start using it just yet.

Connection

In order to consume data from your data warehouse you need to set up the connection to it. Kubit relies on a JDBC connection to execute Reports. The first step is to choose the warehouse type from a list of supported warehouses:

BigQuery
ClickHouse
Databricks
MySQL
Presto
Redshift
Snowflake

Then, you have to fill out the rest of the parameters which are as follows (with the exception of BigQuery):

Username: which should be used to login to your warehouse (we recommend creating a dedicated user for Kubit)
Password: for the username (we recommend a length of 60 characters, a randomly generated mix of uppercase, lowercase, numbers, and special characters)
Timeout: on the JDBC connection, make sure to check what the timeout configured on your warehouse and set it to the same value or less
JDBC URL: to establish the JDBC connection
End Date Type: what should the date picker default to as the end date - today, yesterday, a static date, or something else? The recommendation is to keep the default (CURRENT).

Tables

There are 3 things you can do in this view:

add more facts or tables
edit table (mapping)
create Derived Columns

Add Table

To add a table you start by clicking on the Add Table button:

Then you see what we call the Schema Browser where you can navigate between schemas in the database configured in the Connection tab. In each schema you will see a full list of:

tables
views
materialized views

Once you find the table you want to add you select it, choose whether it's a fact or a dimension table and then hit Confirm:

When you add a new table you need to set some required fields before you can save the changes to your Schema. You will see a table view of all the columns and their types which you can map to Kubit Fields:

and below that you will find the required fields:

Event: this is the event name
Timestamp: the event timestamp is used for finding the sequence of events in the Funnel or Path
Date: the event date is used for all the date ranges in Kubit
Subjects: you need to add at least one Subject per fact table

In case you don't have an event date column in your table you can create a Derived Column and use it!

Add a Custom Join Condition

In Dimension tables, there is an extra field called Custom SQL Condition. This allows you to add extra join conditions to the Dimension.

Edit Table (Mapping)

Which columns from a table should be used in Kubit? Here's a short demo of how you select the required fields and map a couple of columns:

A few things to call out:

once you set the Required Fields for a Table you cannot change them
you can apply type overrides only when you are adding a new table (e.g. a numeric value is stored as a string but you want to do sums with it in Kubit)

In this view, you can also set the Field name a column should map to and its description.

Create a Derived Column

Derived Columns come to the rescue when you don't have all of the required columns already in place or you want to slightly modify the values in a column. A Derived Column, in a nutshell, allows you to create a new logical column based on other columns that already exist in the table. To start, click on Add Derived Column:

Then you have 2 ways to define the column:

Raw SQL: most common use cases are when you need to cast a type, concatenate strings, or pick a property from an object (e.g. JSON or Struct)
Mapped String: useful when you need to map an enum, sometimes the database would utilize a numeric type to store some data and you need the human-readable names in your analytics tool - this is when you will use Mapped String

The SQL option supports only single-row operations. If you'd like to use Window Functions, please check out Derived Fields.

Here are a couple of examples, one is using SQL to create a date column from a timestamp column:

And the other is mapping integers to human-readable names:

Update Derived Columns

Derived Column logic can be updated by clicking on the column and editing its definition. This makes it simple to fix definition mistakes and make definition updates to match changing organization requirements.

Fields

This is where you can:

Enable Fields for Filter and Breakdown
Disable a Field (make it unavailable for selection in any new Reports)
Change the Field Name and Description
Create Derived Fields

Enable Fields for Filter and Breakdown

This is as simple as it gets if you want to tweak whether a Field is eligible for Filter or Breakdown or not all you have to do is (un)check a checkbox and hit Confirm.

Disable a Field

In some cases, you might want to Disable a Field. Maybe you're no longer populating a column with data or you don't trust it - whatever the reason, it's again just a click on top of a checkbox and then Confirm:

Change Field Name and Description

Maybe a Field Name is a bit confusing or ambiguous - nothing to worry about, you can always change it or add a description to make it easier to understand. Here's how:

Derived Fields

There are 2 main scenarios in which you should use Derived Field instead of a Derived Column:

you want to apply SQL Window Functions, e.g. to find out the duration of a session by calculating the time difference between the first and last event in a session
you have multiple fact tables which contain the same Field so instead of defining Derived Column in each table, you can define it as a Derived Field only once

To go back to our session duration example, let's check the Window Functions checkbox and input this sample SQL:

timestampdiff(second, min(@event_timestamp) over (partition by @session_id), max(@event_timestamp) over (partition by @session_id))

Remember to use @ when referencing a column!

When Window Functions aren't being used the behaviour is equivalent to Derived Column, it just may be applied over more than 1 column in case you have multiple fact tables in a union.