How to Create and Use CETAS in Azure SQL Pools

Sections

Introduction

External Tables are useful and powerful in different cases, especially in the Data Lakehouse architecture where you need to access data from Parquet Files or Delta (for example) and create tables to expose data to Power BI or Tableau.

The capabilities of CREATE TABLE AS SELECT(CETAS) are important to mention and can be used in a dedicated SQL pool or Serverless SQL pool. CETAS is a big player when we need speed and performance. They are fast to create and read depending on the type of table distribution we use when creating them.

However, depending on your data lake and Azure Synapse security setup, you will need to check certain folder and credential permissions requirements.

In Part 1 of this blog, we will go through the environment setup, roles, credentials, and creation of CETAS for one or multiple parquet files in the Dedicated Pool. Let’s remember that here you are creating a pointer to the files in the Azure data lake and what we see in the Serverless Pool (Table) is only the metadata that we can query. This means that we are not creating anything physically on the DB.

In part 2, we will use CETAS to dynamically create tables in the Dedicated pool to expose data in the Gold layer that has certain advantages when the business scenario uses this technology instead of a Serverless Pool.

Finally, another capability of CETAS is to create pre-calculate datasets as views for simple or complex queries and have them ready to be consumed by Power BI – we will cover this scenario in part 2.

High-Level Overview

These steps assume you have an existing Azure Synapse Workspace, a Serverless SQL pool, Azure Data Lake Gen 2 folder structure, and a parquet file to be read.

Please note: We will use The Microsoft Wide World Importers DW as a data source for these examples. 

  • Wide World Importers DW can be found here
  • Check Roles to access the AZDL Gen2 storage
  • Create SAS Token for the database credential
  • Check if there is any credential created previously in the Dedicated Pool
  • Create DB Master Key
  • Create DB-scoped credential
  • Create an external data source
  • Create External File Format
  • Verify the source and destination Mapping for CETAS
  • Create an External table on a single file 
  • Create an External table on a set of files

GitHub Code

The code used can be found here as cetas_part1.sql

Environment Setup

Map of Azure Resources in the testing environment.

Storage Account: jgstoragedatalake

Container: deltadev01

Folder: bronze

Source Azure SQL Database: World Wide Importers DW

Azure Dedicated Pool: wwi_dedicatedsql

Azure Synapse Workspace: synapsedev-iterations

Step 1: Check Roles to Access the AZDL Gen2 Storage Account

External Tables can access two types of storage Public and Protected. Here, we are setting up the protected storage using a Shared Access Signature Credential with Azure AD identity.

Go to the storage account and Select Access Control (IAM)

  • Assign Storage Blob Data Contributor Azure role to the Azure Synapse Analytics servers managed identity on the ADLS Gen 2 storage account. 

We can assign this by using the Azure portal, and navigating to the IAM(Identity Access Management) menu of the storage account. We can also do this using Power shell. Please note that the role is assigned to the synapsedev-iterations App type (Azure Synapse).

  • Optional: If you are using a pipeline and trigger to run CETAS as an automatic process and getting a “Forbidden access Error”. ACL might be needed at the folder level using Azure Storage Explorer. You can find more about ACLs in Azure Data Lake Storage Gen2 here.

Step 2: Create SAS Token for the Database Credential

By creating a Shared Access Signature, we are granting access to the resource for a specific period of time. 

In this scenario, we will create the SAS token to be permanent so that we do not lose access after days, weeks, months, etc.

At a high level, this is how SAS tokens work:

  1. Your application submits the SAS token to Azure Storage as part of a REST API request.
  2. If the storage service verifies that the SAS is valid, the request is authorized.
  1. If the SAS token is deemed invalid, the request is declined and the error code 403 (Forbidden) is returned.

After a brief review of how the SAS token functions, let’s create it:

  • Go to storage Account and select Shared Access Signature
  • On the Allowed resource types, select (Service, Container, and Object)
  • For the Start and expiry date/time, set up as follows: 

Please note: Start is your current date and End is to be set up to the year 2099. This also can be set up to the first day and first month of the year 01/01/2099 

  1. Start 10/16/2022
  2. End. 10/16/2099
  • Click on Generate SAS and connection string, this is needed for the creation of the credential at the Database Level.
  • Save the SAS token as it is going to be used later.

Step 3: Check if There Is Any Credential Created Previously in the Dedicated Pool

Using SSMS please use the SQL scripts to verify Credentials in the Database.

If there is a master key created, was likely created in the creation of the database by the developer or DBA.

Step 4: Create DB Master Key

Please note that the Master key must meet policy requirements.

  • Verify the presence of the key

Step 5: Create DB-Scoped Credential

  • Here we are going to use the SAS token we generated in Step 2
  • Please note that to use the SAS token we need to remove the symbol “?” at the beginning of the string for the SECRET
  • Example: ?sv=2021-06-08&ss=   
  • Verify the creation of the credential

Step 6: Create an External Data Source

  • The external data source is going to reference deltadev01 storage account that is protected with a SAS key.
  • Save the ABFSS path needed for the credential. This is the folder where the parquet files are going to be accessed. Please note that this can be found on the properties of the Azure Data Lake Gen 2 storage under the Data blade in Azure Synapse.
  • LOCATION = ABFSS Path
  • CREDENTIAL = Previous Scooped Credential Step 5
  • Please note that ABFSS Path must use the string until .net, do not put the name of the folder
  • Refresh the connection of the Dedicated Pool in SSMS, now we have the External Data Source created

Step 7: Create External File Format

  • Create external file format, in this scenario we are using PARQUET

The following table list the data formats supported:

Data format (Native external tables)Serverless SQL poolDedicated SQL pool  
ParquetYes (GA)Yes (public preview)
CSVYesNo(Alternatively, use Hadoop external Tables)
DeltaYesNo
SparkYesNo
DataverseYesNo
Azure Cosmos DB data formats (JSON, BSON etc.)No(Alternatively, create view)No
  • Please note that this table can be updated since the creation of this blog, review Microsoft Documents
  • Refresh the connection of the Dedicated Pool, now we have the External File Format created

Step 8: Verify the Source and Destination Mapping for CETAS

  • We are using the Customers and Orders from Sales schema for the creation of the Tables. First, we need to describe the structure of the source table to create the SQL script with the appropriate columns and datatypes for the Table destination.
  • Describe Sales. Customers
  • Describe Sales. Orders

Create an External Table on a Single File

  • We are going to use Sales. Customers.parquet for the creation of the External Table
  • Please note that for the source table if you are not the creator and the values for the data types are unknown, we are going to assume the range depending on the context of the column and use max values for strings like nvarchar. Datatypes can vary from DB to DB and this will need further analysis of the table columns. 
  • The LOCATION can be found at the properties level of the folder that contains the parquet file as Name property.
  • LOCATION = Name Property of parquet file
  • DATA_SOURCE = DeltaLakeStorageDS
  • FILE_FORMAT = ParquetFormat
  • Refresh the connection of the Dedicated Pool, now we see the External Table “Customers_cetas” created
  • Finally select the table “Customers_cetas”

Create an External Table on a Set of Files

  • In this case, the method is the same and the only difference is using the asterisk symbol to read all files recursively in the folder. I will only show the final script.
  • LOCATION is the only property that changes in the script with the use of (*)
  • Complete script
  • Please note that in this example I am reading from a subfolder Called Customers where I have all the subset parquet files that are part of the whole dataset for the table.
  • Also, note that if you have a dynamic folder pattern, this one can be satisfied with (*) and is required only for Parquet and CSV. Delta Lake format only needs a root folder. The external table will automatically find the pattern.

Conclusion

Create External Table As Select (CETAS) is a great resource when you need access to files in Azure Data Lake through the different layers of a Data lake House Architecture. It is essential in this type of architecture to manage data spread in the different folders and sometimes in different types of formats like parquet, CSV, or delta.

The level of security needed in these types of projects requires in many cases the use of user delegation to the storage resource that needs to be set up so that automatic processes like Azure Synapse or Data Factory run smoothly and with no errors.

Therefore, using SAS and Azure AD credentials is powerful and beneficial for the privacy of a company’s data. Furthermore, with the use of CETAS the Data analyst or Data engineer can control and explode the data in the data lake for transformations or present information in the Visualization layer to Power BI and Tableau.

All in all, if you find yourself looking for ways to improve the access to files in Azure Data Lake and you need to expose data for the final Analyst well this is the way.

Additional Resources

You can find out more about Create use native External Tables using SQL Pools in Azure Synapse Analytics here

You can find out more about CETAS with Synapse SQL here

You can find out more about CREATE Master Key here

You can find out more about Use Azure Storage Explorer to manage ACLs in Azure Data Lake Storage Gen2 here

3 Comments

  • Hello, thanks for this interesting article!
    I have a problem that you might be able to help me solve:

    I’m trying to use the CETAS instruction with a parquet file as source. This parquet file comes from a copy activity that has an Azure SQL database as source, and my storage gen2 as sink (output as a parquet file).
    When using CETAS without defining the columns definition, it works fine, but if I try to explicitly define the columns/types in the query, then use a SELECT *, I get the message: “Different number of columns in CREATE TABLE or CREATE EXTERNAL TABLE and SELECT query.”

    My parquet file has exactly the same number of columns than the columns definition in my query. Is there something I don’t know about parquet files, such as a “hidden” default column? I also tried to explicitly list the columns in the SELECT statement, but with the same result.
    Thanks for your help!

    Reply
  • Great Article. thank you for the insight. After I create an external table, how do I update the external table with new records that are loaded into the folder. I understand using the * to reference sets of data but will select * from external_table show new data loaded after external table was created

    Reply
    • Hi Kelly thank you for your question. You are right after the External table the “Link” is created to the data in the folders so yes select * from external table will show all data. Remember that with External Tables you are basically creating a “link/pointer” to the files.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Jorge Garcia is an Analytics Consultant at Iteration Insights, a tech-enabled, data analytics solutions provider located in Calgary, Alberta. Jorge has a Bachelor's in Computer Systems Engineering. He has experience as a Database Administrator and is passionate about technology and data. Some certifications he holds are Microsoft Azure Data Fundamentals, Microsoft MTA Data Fundamentals, AWS Cloud Practitioner, LPIC Linux, CompTIA Linux+, and a Database Administration Certificate from the University of Calgary. In his personal time, Jorge likes to travel, practice boxing, read, play video games and watch movies.

Iterate With Us

Signup for the Iteration Insights newsletter for access to recent news and other updates.

Related Posts

Share on

Next Event

Let's get started with a consultation

get started with training

Sign up for our newsletter