Interview questions for Azure
Azure Data Factory is a cloud-based Microsoft tool that collects raw business data andtransforms it into usable information. There is a considerable demand for Azure Data Factory Engineers in the industry. Hence, cracking its interview needs a bit of homework. This Azure Data Factory Interview Questions blog contains the most probable questions asked during Data Engineers job interviews.
1. Why do we need Azure Data Factory?
The amount of data generated these days is huge and this data comes from different sources. When we move this particular data to the cloud, there are a few things needed to be taken care of.
• Data can be in any form as it comes from different sources and these different sources will transfer or channelize the data in different ways and it can be in a different format. When we bring this data to the cloud or particular storage we need to make sure that this data is well managed. i.e you need to transform the data, delete unnecessary parts. As per moving the data is concerned, we need to make sure that data is picked from different sources and bring it at one common place then store it and if required we should transform into more meaningful.
• This can be also done by a traditional data warehouse as well but there are certain disadvantages. Sometimes we are forced to go ahead and have custom applications that deal with all these processes individually which is time-consuming and integrating all these sources is a huge pain. we need to figure out a way to automate this process or create proper workflows.
• Data factory helps to orchestrate this complete process into a more manageable or organizable manner.
2. What is Azure Data Factory?
Cloud-based integration service that allows creating data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.
• Using Azure data factory, you can create and schedule the data-driven workflows(called pipelines) that can ingest data from disparate data stores.
• It can process and transform the data by using compute services such as HDInsight Hadoop, Spark, Azure Data Lake Analytics.
What do we understand by Integration Runtime?
Integration runtime is referred to as a compute infrastructure used by Azure Data Factory. It provides integration capabilities across various network environments.
A quick look at the Types of Integration Runtimes:
• Azure Integration Runtime – Can copy data between cloud data stores and send activity to various computing services such as SQL Server, Azure HDInsight, etc.
• Self Hosted Integration Runtime – It’s basically software with the same code as the Azure Integration Runtime, but it’s installed on your local system or virtual machine over a virtual network.
• Azure SSIS Integration Runtime – It allows you to run SSIS packages in a managed environment. So when we lift and shift SSIS packages to the data factory, we use Azure SSIS Integration Runtime.
3. What is the difference between Azure Data Lake and Azure Data Warehouse?
Azure Data Lake Data Warehouse
Data Lakes are capable of storing data of any form, size, or shape. A Data Warehouse is a store for data that has previously been filtered from a specific resource.
Data Scientists are the ones who use it the most. Business professionals are the ones who use it the most.
It is easily accessible and receives frequent changes. Changing the Data Warehouse becomes a very strict and costly task.
When the data is correctly stored, it determines the schema. Before storing the data, the data warehouse defines the schema.
It employs the ELT (Extract, Load, and Transform) method. It employs the ETL (Extract, Transform, and Load) method.
It’s an excellent tool for conducting in-depth research. It is the finest platform for operational users.
4. What is Azure SSIS Integration Runtime?
Azure SSIS Integration is a fully managed cluster of virtual machines hosted in Azure and designed to run SSIS packages in your data factory. You can scale up SSIS nodes simply by configuring the node size, or you can scale out by configuring the number of nodes in the virtual machine cluster.
5. What is required to execute an SSIS package in Data Factory?
You need to create an SSIS integration runtime and SSIS database catalog hosted on an Azure SQL database or an Azure SQL managed instance.
6. What is the integration runtime?
• The integration runtime is the compute infrastructure that Azure Data Factory uses to provide the following data integration capabilities across various network environments.
3 Types of integration runtimes:
o Azure Integration Run Time: Azure Integration Run Time can copy data between cloud data stores and it can dispatch the activity to a variety of compute services such as Azure HDinsight or SQL server where the transformation takes place
o Self Hosted Integration Run Time: Self Hosted Integration Run Time is software with essentially the same code as Azure Integration Run Time. But you install it on an on-premise machine or a virtual machine in a virtual network. A Self Hosted IR can run copy activities between a public cloud data store and a data store in a private network. It can also dispatch transformation activities against compute resources in a private network. We use Self Hosted IR because the Data factory will not be able to directly access on-primitive data sources as they sit behind a firewall. It is sometimes possible to establish a direct connection between Azure and on-premises data sources by configuring the Azure firewall in a specific way if we do that we don’t need to use a self-hosted IR.
o Azure SSIS Integration Run Time: With SSIS Integration Run Time, you can natively execute SSIS packages in a managed environment. So when we lift and shift the SSIS packages to the data factory, we use Azure SSIS Integration Run Time.
7. What is the limit on the number of integration runtimes?
There is no hard limit on the number of integration runtime instances you can have in a data factory. There is, however, a limit on the number of VM cores that the integration runtime can use per subscription for SSIS package execution.
8. What are the top-level concepts of Azure Data Factory?
• Pipeline: It acts as a carrier in which we have various processes taking place. An individual process is an activity.
• Activities: Activities represent the processing steps in a pipeline. A pipeline can have one or multiple activities. It can be anything i.e process like querying a data set or moving the dataset from one source to another.
• Datasets: Sources of data. In simple words, it is a data structure that holds our data.
• Linked services: These store information that is very important when it comes to connecting an external source.
For example: Consider SQL server, you need a connection string that you can connect to an external device. you need to mention the source and the destination of your data.