Using CSV files from Azure Data Lake Storage Gen2 in SQL Server

Let’s say we have on-premises SQL Server and Azure blob storage account. An external company or source uploads data in the blob storage daily basis. The requirement is to query these csv file data after receiving the csv files in the Azure blob storage. We need to process those files from an on-premises SQL Server using PolyBase.

This tutorial is to show you how to configure PolyBase and query Azure Blob Storage Gen2 data using native T-SQL technique. Following are the required steps to configure a remote Azure Blob storage:

We need an Azure Subscription.
Hierarchical namespace enabled storage gen2 account, blob container and folder.
Blob storage Access Key.
Install and configure PolyBase Engine feature.
Create a database master key
Create database scoped credential.
Create an external file format.
Create an external table.
Query the data with T-SQL.

Our Scenario: In our tutorial, we have the following Azure blob storage configuration:

Azure Blob Account: home80
Blob Storage Container: import
Folder inside the container: csvfile
Inside the folder: Two CSV files “Address.CSV” and “CountryRegion.CSV”.

Step by step:

Step 1: Install and configure PolyBase Engine along with the Hadoop connector.

Step 2: Enable PolyBase Engine and Hadoop connectivity and then restart the SQL Server Service.

EXEC sp_configure

@configname = 'polybase enabled',

@configvalue = 1;

RECONFIGURE;

EXEC sp_configure

@configname = 'hadoop connectivity',

@configvalue = 7;

RECONFIGURE;

Step 3: Create a schema, optional but for better organization:

USE AzureDB

CREATE SCHEMA ext;

Step 4: Create a master key on the database, if one does not already exist. This is required to encrypt the credential secret.

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'azure@123';

Step 5: Create a database scoped credential:

CREATE DATABASE SCOPED CREDENTIAL AzureBlobCredential

WITH IDENTITY = 'home80',

SECRET = 'JcGvud0UIFT4qAlrZSyregc3CoqLkxpB/a3jNYByaukvT0BqT4/TpHTsbjqlOwBEHjevnjbSwPsCBK5WmRJDFA==';

See the following screenshot to obtain the Access Key. Note that the Identity can be anything and it will not be used to authenticate while accessing blob storage container. Here, it is just to satisfy the syntax requirement.

Step 6: Create a external data source “AzureBlobSource” as follows:

CREATE EXTERNAL DATA SOURCE AzureBlobSource

WITH (

TYPE = HADOOP,

LOCATION = 'wasbs://import@home80.blob.core.windows.net',

CREDENTIAL = AzureBlobCredential,

PUSHDOWN = ON

);

Step 7: Create an external file format “TextFileFormat” as follows:

CREATE EXTERNAL FILE FORMAT TextFileFormat

WITH (

FORMAT_TYPE = DELIMITEDTEXT,

FORMAT_OPTIONS

(

FIELD_TERMINATOR =',',

USE_TYPE_DEFAULT = TRUE,

STRING_DELIMITER = '"'

)

);

Step 8: We will be importing two CSV files; thus we need to create two external tables.

CREATE EXTERNAL TABLE [ext].[AddressTbl]

( [AddressLine1] VARCHAR(100),

[City] VARCHAR(50),

[PostalCode] VARCHAR(25),

[ModifiedDate] VARCHAR(32))

WITH (

LOCATION = '/csvfile/Address.csv',

DATA_SOURCE = AzureBlobSource,

FILE_FORMAT = TextFileFormat,

REJECT_TYPE = VALUE,

REJECT_VALUE = 5);

CREATE EXTERNAL TABLE [ext].[CountryRegionTbl]

( [StateProvinceCode] VARCHAR(20),

[StateProvinceName] VARCHAR(50),

[CountryRegionCode] VARCHAR(25),

[CountryRegionName] VARCHAR(25))

WITH (

LOCATION = '/csvfile/CountryRegion.csv',

DATA_SOURCE = AzureBlobSource,

FILE_FORMAT = TextFileFormat,

REJECT_TYPE = VALUE,

REJECT_VALUE = 5);

Step 9: Optionally, we can create statistics on the external tables that we have just created in the above step.

CREATE STATISTICS [statCity ] ON [ext].[AddressTbl]([City])

CREATE STATISTICS [statPostalCodeCity] ON [ext].[AddressTbl]([PostalCode], [City])

Step 10: We are done and now it is the time to execute some SELECT statement.

SELECT AddressLine1,

City,

PostalCode,

CAST(ModifiedDate AS DATETIME) AS ModifiedDate

FROM [ext].[AddressTbl]

WHERE City = 'Los Angeles';

SELECT AddressLine1,

City,

PostalCode,

ModifiedDate

FROM [ext].[AddressTbl]

WHERE PostalCode LIKE '91%'

AND City LIKE 'L%';

Step 11: Here is the out from the above queries:

Conclusion: While PolyBase is a good data integration mechanism, but it has some limitation when developing solution from on-premises SQL Server. The CSV file must be formatted properly to reduce the chance of row rejection. Note that the PolyBase on an on-premises SQL Server does not support FIRST_ROW and REJECTED_ROW_LOCATION, therefore an extra effort is required for the text file row header and bad data in the CSV file, otherwise query will produce unexpected result or will fail when the REJECT_VALUE is reached.

I hope that this will help you to get started with Querying Azure blob storage data with the PolyBase Engine from your on-premises SQL Server.

Further Reading:

PolyBase features and limitations

https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-versioned-feature-summary?view=sql-server-ver15

Install PolyBase on Windows

https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-installation?view=sql-server-ver15

Configure PolyBase to access external data in Azure Blob Storage

https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-configure-azure-blob-storage?view=sql-server-ver15

SQL Server – performance and other stories

Total Pageviews

Tuesday, August 11, 2020

Using CSV files from Azure Data Lake Storage Gen2 in SQL Server

No comments:

Post a Comment