### Interview questions for Data science

In this Data Science Interview Questions blog, I will introduce you to the most frequently asked questions on Data Science, Analytics and Machine Learning interviews. This blog is the perfect guide for you to learn all the concepts required to clear a Data Science interview.

1. What do you understand by the term Data Science?

• Data science is a multidisciplinary field that combines statistics, data analysis, machine learning, Mathematics, computer science, and related methods, to understand the data and to solve complex problems.
• Data Science is a deep study of the massive amount of data, and finding useful information from raw, structured, and unstructured data.
•  Data science is similar to data mining or big data techniques, which deals with a huge amount of data and extract insights from data.
• It uses various tools, powerful programming, scientific methods and algorithms to solve the data-related problems.

2. Discuss Linear Regression?

•  Linear Regression is one of the popular machine learning algorithms based on supervised learning, which is used for understanding the relationship between input and output numerical variables.
•  It applies regression analysis, a predictive modeling technique that finds a relationship between the dependent and independent variables.
•  It shows the linear relationship between independent and dependent variables, hence it is called a linear regression algorithm.
•  Linear Regression is used for prediction of continuous numerical variables such as sales/day, temperature, etc.
•  It can be divided into two categories:
a. Simple Linear Regression
b. Multiple Linear Regression

3. What is Selection Bias?

Selection bias is a kind of error that occurs when the researcher decides who is going to be studied. It is usually associated with research where the selection of participants isn’t random. It is sometimes referred to as the selection effect. It is the distortion of statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate.
The types of selection bias include:
1. Sampling bias: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample.
2. Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
3. Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria.
4. Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests that did not run to completion.

If we talk about simple linear regression algorithm, then it shows a linear relationship between the variables, which can be understood using the below equation, and graph plot.

4. Define Naive Bayes?

Naive Bayes is a popular classification algorithm used for predictive modeling. It is a supervised machine learning algorithm which is based on Bayes theorem.
It is easy to build a model using Naive Bayes algorithm when working with a large dataset. It is comprised of two words, Naive and Bayes, where Naive means features are unrelated to each other.
In simple words, we can say that “Naive Bayes classifier assumes that the features present in a class are statistically independent to the other features.”

5. What do you understand by the term Normal Distribution?

Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled up.
However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell-shaped curve.

6. What is correlation and covariance in statistics?

Covariance and Correlation are two mathematical concepts; these two approaches are widely used in statistics. Both Correlation and Covariance establish the relationship and also measure the dependency between two random variables. Though the work is similar between these two in mathematical terms, they are different from each other.

7. What do you mean by p-value?

•  The p-value is the probability value which is used to determine the statistical significance in a hypothesis test.
•  Hypothesis tests are used to check the validity of the null hypothesis (claim).
•  P-values can be calculated using p-value tables or statistical software.
•  The p-values lies between 0 and 1. It can have mainly two cases:
•  (p-value<0.05): A small p-value indicates strong evidence against the null hypothesis, so we can reject the null hypothesis.
•  (p-value>0.05): A large p-value indicates weak evidence against the null hypothesis, so we consider the null hypothesis as true.

8. Which is the best suitable language among Python and R for text analytics?

Both R and Python are the suitable language for text analytics, but the preferred language is Python, because:

•  Python has Pandas library, by which we can easily use data structure and data analysis tools.
•  Python performs fast execution for all types of text analytics.

9. What do you understand by confusion matrix?

•  Confusion matrix is a unique concept of the statistical classification problem.
•  Confusion matrix is a type of table which is used for describing or measuring the performance of Binary classification model in machine learning.
•  The confusion matrix is itself easy to understand, but the terminologies used in the matrix can be confusing. It is also known as Error matrix.
•  It is used in statistics, data mining, machine learning, and different Artificial Intelligence applications.
•  It is a table with two dimensions, “actual and predicted” and identical set of classes in both dimensions of the table.

10. What is the ROC curve?

ROC curve stands for Receiver Operating Characteristics curve, which graphically represents the performance of a binary classifier model at all classification threshold. The curve is a plot of true positive rate (TPR) against false positive rate (FPR) for different threshold points.