You are currently viewing Understanding dataflow in Azure Data Factory
data engineer

Understanding dataflow in Azure Data Factory

Getting a feeling of the dataflow mapping in ADF

Data ingestion and migration from one storage to another are becoming more common for every data engineer and at the same time transforming ingested data for final storage. Data engineers usually do these tasks in most organizations to develop large-scale data collection, storage, and analysis systems.

Azure Data Factory

Azure Data Factory is Azure’s cloud Extract, Transform and Load (ETL ) service for scale-out serverless data integration and data transformation. You may design data-driven workflows to orchestrate data transfer and massive data transformation.

Data-driven workflows, also known as pipelines, can be planned and created to ingest data from various data repositories, with data flows or computing services like Azure HDInsight, Azure Databricks, Azure Synapse Analytics, and Azure SQL Database.

The purpose of Data Factory is to retrieve data from one or more data sources and convert it into a format you process.

Data Flows Mapping

Mapping data flows are visually designed data transformations in Azure Data Factory. Mapping Data Flows are zero-code solutions for your ETL activities, as intended by Microsoft. The resulting data flows are performed out on scaled-out Apache Spark clusters via Azure Data Factory pipelines as activities. Data flow activities can be operationalized using existing Azure Data Factory scheduling, control, flow, and monitoring capabilities.

Creating a Data Flow

Data flows are created from the Azure data factory resources pane like pipelines and datasets. To create a data flow, click on the Author tab, then the Factory Resources will appear. Select the 3 dots on the Data Flows and click on New data flow.

Add Data Source

  • Click on the Add Source on your Graph pane.
  • Change the Output stream name to what you desire.
  • For the dataset, click on the plus (+) icon and add the data set you want to perform your mapping on. For this article, I will be using the Azure Data Lake Gen 2 storage.
  • Click on this link to see how to Ingest data into Azure Data Lake.

Partitioning

There are partitioning scheme configuration options on the Optimize tab. For additional information on how to improve your data flows, Microsoft doc.

Activate Data Flow Debug

Azure Data Factory data flow’s debug mode allows you to interactively watch the data shape transform while you build and debug your data flows. The debug session can be used for data flow pipeline execution debugging as well as data flow design sessions. Click on the top of your canvas to activate the data flow debug. You can also set the debug time to live to be more than 1 hour.

Data Flow Mapping Actions

Azure Data Flow Mapping can do the following actions listed in the diagram below. We are going to list out the different actions and explain each functionality in this article.

JOIN

This is the first transform, click on the (+) sign in the graph to add transformations.

The join transformation is used to integrate data from two sources or streams in a mapping data flow. A join condition will match all columns from both sources that are present in the output stream.

Below are the different types of JOIN available in data flow Microsoft JOIN.

CONDITIONAL SPLIT

The conditional split transformation routes data rows to different streams based on matching conditions. The split change is similar to a CASE decision structure in a programming language. Microsoft Doc

EXIST

The Exists transformation determines whether your data is present in another source or stream via a row filtering operation. All rows from the left stream that are present or absent from the right stream are included in the output stream. Microsoft Doc.

UNION

The UNION clause in SQL is the same as the Union transformation in Azure Data Factory. It can combine data from two streams of data that have the same or a compatible schema into one stream of data. Microsoft Doc

LOOKUP

lookup transformation reference data in a data flow stream from another source. Columns from matched data are added to your source data as a result of the lookup transformation. Microsoft Doc

DERIVED COLUMN

You can create new columns in your data flow or alter existing ones using derived column transformation. A data stream frequently requires the creation of new computed fields or the updating of data in existing fields. Microsoft Doc

SELECT

The select transformation is used to rename, drop, or reorder columns. The columns that are sent downstream are chosen by this transformation rather than the row data itself. Microsoft Doc

AGGREGATE

Using the GROUP BY clause is one of the most straightforward methods of aggregating or rolling up data in SQL. Microsoft Doc

The following are the list of aggregate function in Data Flow:

approxDistinctCount, avg,avgIf, collect, collectUnique, count, countAll, countDistinct, countAllDistinct, countIf, covariancePopulation, covariancePopulationIf, covarianceSample covarianceSampleIf, first, isDistinct, kurtosis, kurtosisIf, last max, maxIf, mean, meanIf, min minIf, skewness, skewnessIf, stddev, stddevIf, stddevPopulation, stddevPopulationIf stddevSample, stddevSampleIf, sum, sumDistinct, sumDistinctIf, sumIf, topN, variance varianceIf, variancePopulation, variancePopulationIf, varianceSample, varianceSampleIf

SURROGATE KEY

Surrogate key transformation is used to add an incrementing key value to each row of data. This is helpful for creating dimension tables in an analytical data model using the star schema. Microsoft Doc

PIVOT

To make numerous columns from the distinct row values of a single column, use the pivot transformation. When you choose to group by columns and create pivot columns using aggregate functions, you are performing an aggregation transformation. Microsoft Doc

UNPIVOT

Use Unpivot in a mapping data flow to stretch values from several columns in a single record into multiple records with the same values in a single column. This will normalize an unnormalized dataset. Microsoft Doc

WINDOW

You will define window-based aggregations of columns in your data streams in the Window transformation. You can define many aggregations based on data or time frames (SQL OVER clause) with the Expression Builder, including LEAD, LAG, NTILE, CUMEDIST, RANK, and others. Microsoft Doc

FILTER

Row filtering based on a criterion is possible with the Filter transforms. The whole row set that matches the filtering requirement is included in the output stream. A WHERE clause in SQL is comparable to the filter transformation. Microsoft Doc

SORT

You can sort the incoming rows on the active data stream using the sort transformation. Select a single column to sort in either ascending or descending order. Microsoft Doc

ALTER ROW

To set insert, remove, update, and upsert policies on rows, use the Alter Row transformation. Because each row will be tagged with the policy corresponding to the first-matching expression, these conditions should be listed in the order of precedence. Each of those circumstances can lead to the addition, updating, deletion, or assertion of a row (or rows). Microsoft Doc

The Azure Data Factory Data Flow is still improving and other transformations will continue to come up with time. 

Conclusion

In this article, we covered the different aspects of Data Flow in Azure Data Factory. Creating a data flow for mapping transformation and the different capabilities of the data flow. Understanding different cases on when to use certain transformations to get your desired shape from your dataset. 

Leave a Reply