Interview questions for Azure Data factory
1. Why do we need Azure Data Factory?
The amount of data generated these days is huge and this data comes from different sources. When we move this particular data to the cloud, there are few things needed to be taken care of.
• Data can be in any form as it comes from different sources and these different sources will transfer or channelize the data in different ways and it can be in a different format. When we bring this data to the cloud or particular storage we need to make sure that this data is well managed. i.e you need to transform the data, delete unnecessary parts. As per moving the data is concerned, we need to make sure that data is picked from different sources and bring it at one common place then store it and if required we should transform into more meaningful.
• This can be also done by traditional data warehouse as well but there are certain disadvantages. Sometimes we are forced to go ahead and have custom applications that deal with all these processes individually which is time-consuming and integrating all these sources is a huge pain. we need to figure out a way to automate this process or create proper workflows.
• Data factory helps to orchestrate this complete process into more manageable or organizable manner.
2. Briefly describe the purpose of the ADF Service?
• ADF is used mainly to orchestrate the data copying between different relational and non-relational data sources, hosted in the cloud or locally in your datacenters. Also, ADF can be used for transforming the ingested data to meet your business requirements. It is ETL, or ELT tool for data ingestion in most Big Data solutions.
3. What is Azure Data Factory?
Cloud-based integration service that allows creating data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.
• Using Azure data factory, you can create and schedule the data-driven workflows(called pipelines) that can ingest data from disparate data stores.
• It can process and transform the data by using compute services such as HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning.
Data Factory consists of a number of components. Mention these components briefly
• Pipeline: The activities logical container
• Activity: An execution step in the Data Factory pipeline that can be used for data ingestion and transformation
• Mapping Data Flow: A data transformation UI logic
• Dataset: A pointer to the data used in the pipeline activities
• Linked Service: A descriptive connection string for the data sources used in the pipeline activities
• Trigger: Specify when the pipeline will be executed
• Control flow: Controls the execution flow of the pipeline activities
4. What is the integration runtime?
• The integration runtime is the compute infrastructure that Azure Data Factory uses to provide the following data integration capabilities across various network environments.
• 3 Types of integration runtimes:
• Azure Integration Run Time: Azure Integration Run Time can copy data between cloud data stores and it can dispatch the activity to a variety of compute services such as Azure HDinsight or SQL server where the transformation takes place
• Self Hosted Integration Run Time: Self Hosted Integration Run Time is software with essentially the same code as Azure Integration Run Time. But you install it on an on-premise machine or a virtual machine in a virtual network. A Self Hosted IR can run copy activities between a public cloud data store and a data store in a private network. It can also dispatch transformation activities against compute resources in a private network. We use Self Hosted IR because Data factory will not be able to directly access on-primitive data sources as they sit behind a firewall.It is sometimes possible to establish a direct connection between Azure and on-premises data sources by configuring the firewall in a specific way if we do that we don’t need to use a self-hosted IR.
• Azure SSIS Integration Run Time: With SSIS Integration Run Time, you can natively execute SSIS packages in a managed environment. So when we lift and shift the SSIS packages to data factory, we use Azure SSIS Integration Run TIme.
5. What is the difference between the Dataset and Linked Service in Data Factory?
• Linked Service is a description of the connection string that is used to connect to the data stores. For example, when ingesting data from a SQL Server instance, the linked service contains the name for the SQL Server instance and the credentials used to connect to that instance.
• Dataset is a reference to the data store that is described by the linked service. When ingesting data from a SQL Server instance, the dataset points to the name of the table that contains the target data or the query that returns data from different tables.
6. What is the limit on the number of integration runtimes?
There is no hard limit on the number of integration runtime instances you can have in a data factory. There is, however, a limit on the number of VM cores that the integration runtime can use per subscription for SSIS package execution.
7. What is the difference between Azure Data Lake and Azure Data Warehouse?
Data Warehouse is a traditional way of storing data which is still used widely. Data Lake is complementary to Data Warehouse i.e if you have your data at a data lake that can be stored in data warehouse as well but there are certain rules that need to be followed.
DATA LAKE DATA WAREHOUSE
Complementary to data warehouse Maybe sourced to the data lake
Data is Detailed data or Raw data. It can be in any particular form.you just need to take the data and dump it into your data lake Data is filtered, summarised,refined
Schema on read (not structured, you can define your schema in n number of ways) Schema on write(data is written in Structured form or in a particular schema)
One language to process data of any format(USQL) It uses SQL
8. What is blob storage in Azure?
Azure Blob Storage is a service for storing large amounts of unstructured object data, such as text or binary data. You can use Blob Storage to expose data publicly to the world or to store application data privately. Common uses of Blob Storage include:
• Serving images or documents directly to a browser
• Storing files for distributed access
• Streaming video and audio
• Storing data for backup and restore disaster recovery, and archiving
• Storing data for analysis by an on-premises or Azure-hosted service