Databricks Lakehouse

JVM AGENT

Item Type Support

SIGNALS SCALARS CONDITION

Overview

The SQL connector enables Seeq to access data from Databricks Lakehouse.

Prerequisites

You will need some information to configure a connection to your Databricks Lakehouse.

HttpPath

The HTTP path can be found in the JDBC configuration of your cluster

Authentication

Set the Password to the value for the PAT token, the user name is not used.

Hostname

The URL for the Databricks cluster, which can be found in the Databricks UI. This field has to contain the complete URL for the Databricks cluster including the port number like in the following example: adb-1234567890.11.azuredatabricks.net:443/default.

Configuration

This is an example configuration template that is displayed in the Additional Configuration box that appears when you click Configure for an existing datasource (or if a new datasource is being created, in the Create new datasource connection modal that appears after clicking Add Datasource and choosing the Databricks template) on the Datasources administration page.

JSON

{
    "QueryDefinitions": null,
    "Type": "DATABRICKS",
    "Hostname": "<insert hostname>",
    "Port": 0,
    "DatabaseName": null,
    "Username": null,
    "Password": null,
    "InitialSql": null,
    "TimeZone": null,
    "PrintRows": false,
    "UseWindowsAuth": false,
    "SqlFetchBatchSize": 100000,
    "HttpPath": "<insert http path>",
    "Token": "<insert token>",
    "ExpandQueryRange": true,
    "JdbcConnectionStringOverride": null
}

Standard SQL Additional Configuration

Property Name	Default Value	Data Type	Description
`QueryDefinitions`	null	Array[QueryDefinition]	The definition for how Seeq should query for data. If your hostname is of the form `"abc\def"`, you will have to escape the backslash like so: `"abc\\def"`.
`Hostname`	null	String	The hostname of your datasource.
`QueryDefinitionExpansionLimit`	1,000,000	Integer	The maximum number of signals that can be indexed from a single query definition. This value is here to protect against incorrect query definitions producing many millions of invalid signals.
`Port`	0	Integer	The port for the JDBC Connection.
`Database Name`	null	String	Optional: Can be defined here or as part of a fully qualified table name in the QueryDefinition.
`Username`	“"	String	The user name
`Password`	null	String/SecretFile	The user password.
`JdbcConnectionStringOverride`	null	String	Optional: Can be specified if you have a known, functioning JDBC connection string. If specified, `Hostname`, `Port` and `Database Name` need not be specified.
`InitialSql`	null	String	Optional: A SQL command that would be run one upon establishing a connection.
`TimeZone`	null	String	Optional: The time zone to use for timestamp or datetime columns. For example, to set this to US Pacific Time, you would use `America/Los_Angeles`.
`PrintRows`	false	Boolean	The rows from the SQL query will be printed to the jvm-link log. This is for debugging purposes only, and should be set to `false` in normal operation.
`UseWindowsAuth`	false	Boolean	Note: This is not available for all Database Types. If you are using a database type that supports Windows Authenication, you will need to ensure that the remote agent is running as the correct user.
`JdbcProperties`	null	Map<String, String>	A map of key value pairs that are configured as connection properties
`MinimumIdleConnections`	null	Integer - {0,1}	Specifies the minimum number of idle connections to maintain in the connection pool. If not explicitly specified, it defaults to a database-specific value, typically 1. For Databricks, it defaults to 0 to prevent the cluster from being kept running but it can be changed to 1 if needed.

Time Zone

Some SQL date and/or time column types have no zone information. The TimeZone field is available to specify the time zone that Seeq should use for data coming from columns types that have no time zone information of their own. UTC offsets (+01:00, -10:30, etc.) and IANA regions (America/Los_Angeles) are accepted. If no time zone is specified, Seeq defaults to the local region of the Seeq server. If your data was stored in UTC time, set this field to "UTC" or "+00:00". If your data was entered using a "wall clock", set this to the IANA time region of the "wall clock". Note that offsets are constant throughout the year whereas a region may observe daylight savings time. If you used a wall clock in a location that observes daylight savings time, a region is a better choice than an offset for this field. A list of IANA regions (tz database time zones) can be found here.

Databricks-Specific SQL Additional Configuration

Property Name	Default Value	Data Type	Description
`HttpPath`	““	String	The HTTP path for the Databricks cluster, which can be found in the Databricks UI.
`ShouldMonitor`	false	Boolean	Will turn off connection monitoring which causes a request once every 5 seconds. This request does not use warehouse time, but does count against your cloud services credits.
`TokenUrl`	Empty	String	The token endpoint URL provided by your identity provider. This field is only applicable when using OAuth 2.0 token pass-through authentication.
`ClientId`	Empty	String/SecretFile	The Client ID of the service principal created in your identity provider. This field is only applicable when using OAuth 2.0 token pass-through or machine-to-machine (M2M) authentication.
`ClientSecret`	Empty	String/SecretFile	The Client Secret of the service principal created in your identity provider. This field is only applicable when using OAuth 2.0 token pass-through or machine-to-machine (M2M) authentication.
`Scope`	Empty	String	The scope configured in both the identity provider and Databricks. This field is only applicable when using OAuth 2.0 token pass-through authentication

Authentication Field Requirements by Type

Authentication type Field	Databricks personal access token	OAuth 2.0 token pass-through authentication	OAuth machine-to-machine (M2M) authentication
`Username`
`Password`
`TokenUrl`
`ClientId`
`ClientSecret`
`Scope`

Known Issues

As of June 2024, the Databricks JDBC driver does not yet support Arrow serialization on Java 21+. If you are using Arrow serialization in your Databricks Lakehouse, you will need to append ;EnableArrow=0 to the httpPath parameter in the connection configuration.

This can also be achieved using:

CODE

      "JdbcProperties": {
        "EnableArrow": "0"
      },

Please report any issues you find to our support portal.

Troubleshooting

A couple of errors that you may encounter are:

String-valued samples are prohibited in numeric-valued signal

If the y-axis value of the signal is a string, then the Value Unit Of Measure property is required and must be set to "string". See Example 2.

Since the Value Unit Of Measure is different for string and numeric signals, it may be easiest to write one query definition for the numeric signals and write another for the string signals. Alternatively, the Value Unit Of Measure property could be set according to an SQL IF statement similar to the technique used in Example 13.

Samples must be ordered by their keys

If this is occurring when trending near the daylight savings transition, this is an indication that the TimeZone is not configured properly. For example, if TimeZone is set to "America/Los_Angeles", this means that the timestamp data in the SQL table was recorded using "America/Los_Angeles" time (Pacific) which observes daylight savings. During the spring daylight savings transition, time skips from 01:59:59.9 to 03:00:00.0 which means that the 02:00 hour doesn't exist and therefore there should be no data in the SQL table during that 02:00 hour. Any data in the 02:00 hour is interpreted as being in the 03:00 hour. If there is also data in the 03:00 hour, the samples will be out of order.

Original data:	01:15, 01:45, 02:15, 02:45, 03:15
After accounting for non-existent 02:00 hour:	01:15, 01:45, 03:15, 03:45, 03:15

If data exists in the 02:00 hour, it must mean it was either recorded in error or was recorded in a time zone that doesn't observe daylight savings such as UTC or a constant offset from UTC.

For more information, see the TimeZone field in the Configuration section above.

If you are running into issues with connecting to or access data from Snowflake, view our guide for troubleshooting datasource issues.

Performance considerations

Appropriate partitioning and Z-ordering should be leveraged to have reasonable performance

View our guide on optimizing datasource performance for general guidance.