Skip to main content
Databricks

Integrate via Direct Connect

Updated over a week ago

Steps

  1. Create a Serverless SQL Warehouse (size depends on your data volume)

  2. Create a Service Principal with the following permissions:

    1. Read-only permissions to your intended data tables/views

    2. Create table permission (to create a couple of auxiliary tables with full access)

    3. Permission to access the warehouse (to execute the analyses)

    4. Access to the Query History of the user (for troubleshooting purposes)

A separate set of the above for Development purposes to isolate the impact on Production and avoid accidents. We will be ready to provide detailed step-by-step instructions once your team is ready.

Best Practices

Partitioning

Like other warehouses, Databricks supports table partitions which can be used to speed up queries and optimize costs at the same time. The main consideration when selecting partition columns in order to get the most performance out of your Kubit environment is to have an (event) date column as partition key on your fact table(s). Alternatively, a common pattern on Databricks is to use 3 partition keys - year, month and day - we can still leverage those and there is no need to create a separate date column in that case.

Warehouse Configuration

You have 4 major parameters to think about when configuring your warehouse:

  • Cluster Size - start small and only improve based on evidence from Query History. If the warehouse feels slow you should examine your queries with Query Profile. Look for Bytes spilled to disk above 1, if the warehouse is spilling often you need to increase the size.

  • Auto Stop - this setting controls for how long the warehouse should remain up and idling after all running queries have finished executing. The trade-off is between cost saving and a few additional seconds on some Kubit reports to account for the warehouse waking up. Our recommendation is to set it to the minimum possible - 5 minutes from the UI or 1 minute (you will need to use the warehouse SQL API to do that)

  • Scaling - always start with a minimum and maximum of 1. You only need to scale your warehouse once you reach the point of running 10 queries concurrently multiple times a day.

  • Type - as a rule of thumb go with Serverless. But the answer is - it depends on how much load you are going to put on the warehouse. You should be aware that Serverless becomes more expensive than Pro if it's running 24/7.

References:

Usage Monitoring and Optimization

Keep your costs in check by leveraging the monitoring tools that Databricks provides:

⚠️ Be careful

A stopped Serverless Warehouse can start even if there is no query for it to execute in the following cases:

  • A connection is established to a stopped warehouse from a JDBC/ODBC interface.

  • A dashboard associated with a dashboard-level warehouse is opened.

Start a SQL Warehouse contains a full list of conditions which will wake a stopped warehouse.

Did this answer your question?