Querying ADLS using the ADX Connector and External Tables
Overview
An effective mechanism to query data in an Azure Data Lake Gen 2 storage account is to leverage external tables through ADX and then use the ADX connector to read the data. Do note that performance is highly dependent on the source data format and the partitioning scheme within ADLS. You can find more information about external tables here.
Configuration
External Table Creation
The connector configuration will be configured as per the configuration section here: Azure Data Explorer (ADX). This section will outline how to configure the external tables and stored functions to access the data from within Seeq. This section will be driven by example. This example is assuming that there is a hierarchy of parquet files stored within ADLS in the path format below:
Region_Name/Sensor_Name/yyyy/MM/dd/file.parquet
Using this format, data from a temperature sensor in Region A would be broken up in a file per day and stored within the associated directory path. For example, data collected on June 22, 2020, would be in the following path:
Region_A/Temperature/2020/06/22/data.parquet
The first step in accessing this data in Seeq is to define an external table. This is done via Kusto and defining the external table as documented here. For this example, it would look like the KQL below:
.create-or-alter external table SensorDataPart (Time:datetime , Value:real)
kind=blob
partition by (RegionName:string, SensorName:string, Date:datetime=startofday(Time))
pathformat= (RegionName "/" SensorName "/" datetime_pattern("yyyy/MM/dd", Date))
dataformat=parquet
(
h@'https://<blob-store-url>;<secret-key>'
)
This external table definition also assumes that the parquet file schema is a series of time/value pairs.
Once this external table is created, you can verify that it is identifying the correct artifacts within the ADLS storage account by executing the following query:
.show external table SensorDataPart artifacts
This should result in a list of the files that ADX identified and the partitions that they are located within.
The data can be verified by executing the following query:
external_table('SensorDataPart')
| take 10
This should return the top 10 rows from that external table.
Stored Function Creation
Because the ADX connector requires the ability to escape spaces within a table name, the easiest mechanism to access this external table from Seeq is via a stored function rather than the external_table function call. To create a stored function that passes the queries through to the external table, execute the following Kusto command:
.create-or-alter function with (docstring = "QueryPartitionedDataLake",folder = "Seeq") QueryPartitionedDataLake {external_table('SensorDataPart')}
This will create a stored function called QueryPartitionDataLake that can now be referred to as a table in the ADX connector configuration file.
Connector Configuration
To configure ADX, follow the steps located here. The table configuration is the main area that this configuration will be concerned with.
{
"TableDefinitionName" : "ExternalTablePartitioned",
"TableName" : "QueryPartitionedDataLake",
"TransformQuery" : null,
"TimeColumn" : "Time",
"TimeUnit" : null,
"DataColumns" : [ "Value" ],
"DataColumnQuery" : null,
"GroupBy" : [ "RegionName", "SensorName" ],
"LastGroupAsSignalName" : true
}
The table configuration above uses the stored function name, QueryPartitionedDataLake, as the value of the TableName field. Additionally, the TimeColumn value is Time as the schema in the external table definition above is defined. The GroupBy list is also constructed from the partition information, and this will be used as the asset name and tag name, respectively.