PolyBase – as data integration and Processing Technique:
I have experienced, many developers lack a good understanding of query processing techniques. While small scale data processing tasks using Linked Server work, for a heavier workload, Linked Server query suffers significantly from performance issues, extensive resource utilization and often receives “query-timeout”.
This article is not a how-to guide for writing efficient Linked Server based query, but rather using an alternative technique known as PolyBase to overcome some known query processing difficulties. Starting from SQL Server 2016, Microsoft has introduced PolyBase. The functionality of PolyBase supports all kind of data sources and it has now becomes a de-facto data integration choice and technique to process remote datasets on Microsoft Data Platform.
Here, we will examine and compare the same query utilizing both “Linked Server” and “PolyBase”.
Query Testing Scenario:
The following is the local and remote server, where we will be testing each of the techniques:
- POLY01: We have SQL Server 2019, with a single database named TestDB which has only one table titled “StateProvinceCountryRegion”.
- WIN1601: We have SQL Server 2014, with the database “AdventureWorks2014”.
We will write a query from POLY01 which will join a table on the remote server WIN1601. We will use “Linked Server” and PolyBase to execute the same query to process the same data.
Linked Server: Lets create a “Linked Server” named LNKSRV from POLY01 to WIN1601.
On WIN1601 server:
(a) Create a login “ployuser” with password “poly@123”.
(b) Grant data_reader permission to the polyuser login on “AdventureWorks2014”. (Note we can also grant only SELECT permission to the table “Person.Address”).
On POLY01 server:
(c) Execute the following statement to create the Linked Server, LNKSRV.
USE [master];
GO
EXEC master.dbo.sp_addlinkedserver
@server = N'LNKSRV',
@srvproduct = N'SQLSERVER',
@provider = N'SQLNCLI',
@datasrc = N'WIN1601';
GO
EXEC master.dbo.sp_addlinkedsrvlogin
@rmtsrvname = N'LNKSRV',
@useself = N'False',
@locallogin = NULL,
@rmtuser = N'polyuser',
@rmtpassword = 'poly@123';
GO
(d) Execute the following query on POLY01 server to test the Linked Server configuration.
USE TestDB
GO
SELECT
a.AddressID,
a.AddressLine1,
a.City,
a.StateProvinceID ,
a.PostalCode
FROM OPENQUERY( [LNKSRV], 'SELECT AddressID, AddressLine1, City, StateProvinceID, PostalCode FROM [AdventureWorks2014].[person].[Address] ' ) AS a
Note: Direct specification of remote server as a data source will not work when a table contains a GEOGRAPHY or a GEOMETRY column. For example, the following query will fail with the error mentioned below:
SELECT a.AddressID,
a.AddressLine1,
a.City,
a.StateProvinceID,
a.PostalCode
FROM [LNKSRV].[AdventureWorks2014].[person].[Address] AS a;
Msg 7325, Level 16, State 1, Line 2
Objects exposing columns with CLR types are not allowed in distributed queries. Please use a pass-through query to access remote object '"AdventureWorks2014"."person"."Address"'.
PolyBase Connection:
On POLY01 Server, the Polybase Service has already been installed and configured, and is running. So the next task is to create a sample database and an external table.
(a) Let’s create a test database:
USE master;
GO
CREATE DATABASE TestDB;
GO
(b) Create a master key on the TestDB database:
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'poly@123';
(c) Create a database scoped credential:
CREATE DATABASE SCOPED CREDENTIAL polycredential
WITH IDENTITY = 'polyuser', SECRET = 'poly@123';
(d) Create an external data source:
CREATE EXTERNAL DATA SOURCE ADWSource
WITH (LOCATION = 'sqlserver://WIN1601',
PUSHDOWN = ON,
CREDENTIAL = polycredential);
(e) Create an external table (creating schema is optional):
CREATE SCHEMA ext;
CREATE EXTERNAL TABLE [ext].[LinkTbl]
([AddressID] [INT] NULL,
[AddressLine1] [NVARCHAR](60) NULL,
[AddressLine2] [NVARCHAR](60) NULL,
[City] [NVARCHAR](30) NULL,
[StateProvinceID] [INT] NULL,
[PostalCode] [NVARCHAR](15) NULL)
WITH (LOCATION = '[AdventureWorks2014].[person].[Address]',
DATA_SOURCE = [ADWSource]);
(f) Execute the following query on the POLY01 server to test the PolyBase configuration.
USE TestDB
GO
SELECT a.AddressID,
a.AddressLine1,
a.City,
a.StateProvinceID,
a.PostalCode
FROM ext.LinkTbl a;
Performance comparison Query:
Now that we have configured both the Linked Server and PolyBase, let’s create a table in the TestDB database as “StateProvinceCountryRegion” on POLY01. This table was derived from a view “vStateProvinceCountryRegion” in the “AdventureWorks2014” database. Also, create a clustered index on the StateProvinceID column.
To compare the query performance from Linked Server and PolyBase, execute the following code while enabling the “Include Live Query Statistics” in SSMS.
-- Query using PolyBase
DBCC DROPCLEANBUFFERS;
USE TestDB;
GO
SELECT a.AddressID,
a.AddressLine1,
a.City,
a.StateProvinceID,
a.PostalCode,
b.StateProvinceID,
b.StateProvinceCode,
b.StateProvinceName,
b.CountryRegionCode,
b.CountryRegionName
FROM ext.LinkTbl a
LEFT JOIN dbo.StateProvinceCountryRegion b ON a.StateProvinceID = b.StateProvinceID
WHERE b.CountryRegionName = 'Canada'
AND a.City = 'Calgary';
-- Query using Linked Server
DBCC DROPCLEANBUFFERS;
SELECT a.AddressID,
a.AddressLine1,
a.City,
a.StateProvinceID,
a.PostalCode,
b.StateProvinceID,
b.StateProvinceCode,
b.StateProvinceName,
b.CountryRegionCode,
b.CountryRegionName
FROM OPENQUERY([LNKSRV], 'SELECT AddressID, AddressLine1, City, StateProvinceID, PostalCode FROM [AdventureWorks2014].[person].[Address] ') AS a
LEFT JOIN dbo.StateProvinceCountryRegion b
ON a.StateProvinceID = b.StateProvinceID
WHERE b.CountryRegionName = 'Canada'
AND a.City = 'Calgary';
Linked Server vs PolyBase Query Execution plan and Statistics:
Performance Gain:
From the above execution plan, it is crystal clear that the query cost of PolyBase is 2% whereas, whereas it is 98% for Linked Server.
As we can see from the query statistics, using PolyBase to query remote data is significantly more efficient. So what is the difference between the two techniques? In a nutshell, the Linked Server query fetches all the records from the target server and then filters it, while the PolyBase query applies the filter condition first and then fetches the record.
Conclusion:
If we understand the above execution plan, then I hope that it is now easy to rewrite a Linked Server query to significantly improve performance. If tuning or rewriting a Linked Server query is difficult for whatever reason, then it will be a good idea to start using the PolyBase technique with your next development project.
Further Reading:
What is PolyBase?
Configure PolyBase to access external data in SQL Server
PolyBase Transact-SQL reference
No comments:
Post a Comment