Home / Data Science

Rating: 5.0

Views: 11609

by Vinod M

Last modified: September 4th 2021If you are in search of Data science interview questions, then you have landed at the right place. You might have heard this saying so many times, "Data Science has been called the Sexiest Job of the 21st century". Due to the increased importance of data, the demand for data scientists has been growing over the years. **According to IBM’s predictions, demand for the data scientist role will soar 28% by 2021**.

This blog is specifically designed to provide you with the frequently asked **Data science interview questions**. We have segregated the interview questions based on their belonging and complexity. The questions presented over here are collected based on the opinions of the Data science experts.

Types of Data Science Interview Questions |

We are going to cover essential questions in each segment that are asked frequently. Without wasting much time, let's discuss one by one topic in detail.

**What is Data Science?****What is power analysis?****What is Root cause Analysis?****What makes the difference between "Long" and "Wide" Format data?****What is meant by linear regression?****What is Cross-validation?****List different types of classification algorithms?****While conducting analysis, how can you treat missing values?****What is meant by Deep learning?****Explain the difference between machine learning and deep learning?**

Data science is defined as a multidisciplinary subject used to extract meaningful insights out of different types of data by employing various scientific methods such as scientific processes and algorithms. Data science helps in solving analytically complex problems in a simplified way. It acts as a stream where you can utilize raw data to generate business value.

Supervised Learning: Supervised learning is a process of training machines with the labeled or right kind of data. In supervised learning, the machine uses the labeled data as a base to give the next answer.

Unsupervised learning: It is another form of training machine using information that is unlabeled or unstructured. Unlike Supervised learning, there is no special teacher or predefined data for the machine to quickly learn from.

Enthusiastic about exploring the skill set of Data Science? Then have a look at the Data Science Training Certification Course. |

It is a statistical hypothesis testing for conducting randomized experiments between two variables A and B. The A/B testing is used to identify the changes for an outcome of interest or to maximize a web page. A/B testing helps organizations in finding out the right outcome that works best for a particular business.

Selection bias is a type of error that arises when the researcher decides on whom he is going to conduct the study. It happens when the selection of participants takes place not randomly. Selection bias is also sometimes referred to as a selection effect. It works more effectively and sometimes if the selection bias is not taken into account, the conclusions of the study may go wrong.

The classification technique is widely used in mining data sets.

It is an experimental design technique for expecting the outcome of a given sample size.

The Root cause analysis was initially used to make an analysis on industrial accidents, but now it has been extended into many areas. It is a technique that is being used to identify the root cause of a particular problem.

Filtering is a process used by recommender systems to find patterns and information from numerous data sources, several agents, and collaborating perspectives. In other words, the collaborative method is a process of making automatic predictions from human preferences or interests.

**Sampling Bias:**This bias arises when you select only particular people or when a non-random selection of samples happened. In general terms, it is nothing but a selection of the majority of the people who belong to one group.**Time Interval:**sometimes a trial may be terminated earlier than the actual time (probably due to some ethical reasons) but the extreme value finally taken into consideration is the most significant value even though all other variables have a similar Mean.**Data:**We can name it as a Data bias when a separate set of data is taken to support a conclusion or eliminates terrible data based on the arbitrary grounds, instead of generally relying on generally stated criteria.**Attrition Bias:**Attrition bias is defined as an error that occurs due to the Unequal loss of participants from a randomized controlled trial (RCT). There are some cases in which the participant's losses due to various reasons is called Attrition.

The central limit theorem is defined as a statistical theory that indicates that given a sufficiency large sample size from the available population with a finite level of variance, which means the mean of the sample population is approximately equal to the mean of the total population.

**Top 10 Highest Paying IT Jobs In 2021**

Data sampling is a statistical analysis method in which a particular portion of data is taken to analyze, identify the hidden trends and patterns in data. With the help of the sampling method, a larger set of data being examined. It helps the data scientists to work with a limited portion of data to produce accurate results rather than working on entire data sets.

Types of sampling methods are:

- Simple random sampling method
- Stratified sampling
- Cluster sampling
- Multistage sampling
- Systematic sampling

In a wide format method, when we take a subject, the repeated responses are recorded in a single row, and each recorded response is in a separate column. When it comes to Long format data, each row acts as a one-time point per subject. In wide format, the columns are generally divided into groups whereas in a long-form the rows are divided into groups.

Sometimes the data is distributed in different ways, and sometimes it can be done with the bias to the right or left.

However, we can have a chance here to distribute the data around a central value without showing any bias to left or right when you distribute data around the central value that reaches the bell-shaped curve as mentioned below.

The random variables are distributed in the form of a bell-shaped curve.

Below mentioned are the properties of Normal distribution.

- Unimodel- one model
- Symmetrical- which means left and right halves look like mirror images to each other.
- It looks bell-shaped, and the maximum height is at the mean.
- In normal distribution mean, median, and mode are found at the center.

The type I error generally occurs when the null hypothesis is true, but when rejected. And when we consider the Type II error, it occurs when the null hypothesis is false but erroneously fails to be rejected.

Linear regression is commonly used for conducting predictive analysis. It helps us in examining two things. For instance, linear regression is used to compare two factors that belong to a particular thing. Let's say the price of a house depends on two different factors such as location and size. To find the relationship between these factors, we need to conduct a linear regression. Linear regression helps us in finding the positive or negative effects of these two relationships.

The word Sensitivity is often used in validating the accuracy of a classifier (SVM, Random Forest, Logistics, etc.).

In statistical analysis, Sensitivity is treated as predicted events that are true. The true events are nothing but the events which are actually true in nature and the model also predicted them as true.

The calculation of seasonality is pretty straightforward.

Seasonality = ( True Positives ) / ( Positives in Actual Dependent Variable )

In machine learning as well as in statistics, the common task to undergo is to fit a model to a set of training data. It helps us in making reliable predictions using general untrained data.

In overfitting, a statistical model will help us in letting us know the random noise or errors instead of the underlying relationship. Overfitting comes into light when the data is associated with too much complexity, which means it is associated with so many parameters relative to the number of observations. A model that is overfitted is always performed poorly in predictive performance and acts overly to the minor fluctuations in the training data.

Unnderfittinng happens when a machine learning algorithm or statistical model is unable to focus on the underlying insights of the data. The case when you are trying to fix a linear model to a nonlinear one. This kind of model would result in poor predictive performance.

We actually prefer python because of the following reasons.

- R suites best for machine learning rather than text analytics
- Python works in a swift mode to work with all types of text analytics
- Python has a pandas library that supports high-performance data analysis tools and easy-to-use data structures.

Univariate analysis is a descriptive analysis and can be used to differentiate the number of variables involved at a given point of time. For instance, the sales of a particular territory include only one variable, and then the same is treated as a Univariate analysis.

Bivariate analysis is used to understand the difference between two variables at a given time on the scatter pilot. The best example for bivariate analysis of the difference between the sale and expenses happens for a particular product.

Multivariate analysis is used to understand the more than two variables' responses for the variables.

Cluster sampling will come into the light when you want to conduct a study on the population that spread across a wide area and when it is difficult to apply simple random Sampling. Cluster sampling is used to collect the samples from each sampling unit or cluster of elements.

For instance: A researcher in India wants to study the academic performance of the students. So to do that he divided the entire country into cities (clusters). Then he can select the cities based on his research requirements such as systematic or simple random sampling.

Systematic sampling is a technique, and the name resembles that it follows some systematic way, and the samples are selected from an ordered sampling frame. In systematic sampling, the list is actually in a circular manner and the selection starts from one end and reaches the final, and the cycle goes on. The equal probability method would be the best example for systematic sampling.

A validation set helps select the parameter and helps in avoiding overfitting the model that is being built.

When it comes to Test Set, it is being used to test the trained machine learning models.

It is a model validation technique used to evaluate how the statistical analysis would generalize to an independent dataset. This could be helpful in the areas of backgrounds where the objective is exactly forecasted, and the people want to estimate how accurately the model would work in real-time.

The main ambition of cross-validation is to test a model that is to test a model which is in the training phase and limit the problems like overfitting and to get insights on how to generalize the to an independent data set.

Machine learning is a subfield of artificial intelligence, and that enables the systems to learn themselves from the data and previous experiences. Machine learning develops programs that are capable of accessing data and learns out of it.

Yes! Time series analysis can be used in machine learning based on the application requirements.

Below mentioned are the different types of algorithms.

- Logistic regression
- Support Vector Machines
- Nearest Neighbor
- Boosted trees
- Decision trees
- Neural Networks
- Random Forest

We have three different types which are,

- Reinforcement learning
- Supervised learning
- Unsupervised learning

Recommender systems are also treated as information filtering systems that work to predict or likeness of a user for a product. These recommender systems are widely used in areas like news, movies, social tags, music, products, etc.

We can see movie recommenders on Netflix, IMDB, & BookMyShow, and product recommender e-commerce sites like eBay, Amazon, Flipcart, Youtube video recommendations, and game recommendations.

Linear regression is a technique used in statistics where one variable is presented on the Y-axis, and the other one is presented on X-axis. They both depend on each other. In linear Regression, Y is referred to as a creation variable and X as a predictor variable.

We can identify the outlier values by using the graphical analysis method or by using the Univariate method. It becomes easier and can be assessed individually when the outlier values are few but when the outlier values are more in number then these values required to be substituted either with the 1st or with the 99th percentile values.

Below are the common ways to treat outlier values.

- To bring down and change the value
- To remove the value

Following are the steps involved in any analytics project

- Identify and understand the business problem
- Examine the data and understand it well
- Prepare data modeling by using various methods
- Once the model is prepared, start running the model and analyze the results.
- Examine the model with the help of a new dataset
- Implement the model and analyze the outcomes over a period of time.

To identify the extent of the missing values, first, you have to identify the variables with missing values. If the analyst finds any patterns, then he can concentrate on them because that's where the places we can find meaningful business insights.

If in case, the analyst is unable to find the patterns, then he can substitute the missing values with median or mean values; otherwise, they can even be ignored. They are assigning a default value that can be minimum or maximum or median values.

For instance, if 80% of the variables are missing, you can answer the interviewer this as you would drop the variables instead of considering the missing values.

Deep learning is a function of artificial intelligence that inserts the capability to the machines to mimic the human brain tasks such as data processing and analyzing the data for taking valid decisions. Deep learning is a subfield of machine learning and is capable enough to learn from the data that is unsupervised or unstructured.

Even though the existence of Deep learning is there, but it has gotten popular in recent years due to the following two reasons.

Data is the primary source for the effective functioning of deep learning, and the generation of data from various sources has been increased massively over the years.

The development of the hardware resources to process the data models in an advanced manner.

Artificial neural networks are the main elements that have made machine learning popular. These neural networks are developed based on the functionality of the human brain. Artificial neural networks are trained to learn from examples and experiences without being programmed explicitly. Artificial neural networks work based on nodes called artificial neurons that are connected to one another. Each connection acts similar to synapses in the human brain that helps in transmitting the signals between the artificial neurons.

Artificial neural works are similar to the biological human brain. The neural network works similarly to the human nervous system that receives, processes, and transmits information. A neural network consists of three layers in general which are,

- Input Layer
- Hidden Layer
- Output Layer

The inputs get processed with weight sums and bias with the assistance of Activation functions.

Backpropagation is an algorithm used in Deep Learning to train the multilayer neural network. Using this method, we can move an error form an end of a network to the inside of it, and that brings the efficient computation of gradient.

It consists of the below-mentioned steps:

- Forward data propagation of data that is being used for training
- Derivatives are computed with the help of output and target.
- Backpropagation for computing the derivative error.
- You can use the output that was previously calculated for output.
- Update the weights.

Below mentioned are the three different variants of backpropagation

- Stochastic Gradient Descent: In this module, we take the help of the single training as an example for updating the parameters and for the calculation of gradient.
- Batch Gradient Descent: in this backpropagation method, we consider whole data to calculating the gradient and executes the update at each iteration.
- Mini-batch Gradient Descent: It is considered a popular optimization algorithm in deep learning. In this Mini-batch gradient Descent instead of a single training example, a mini-batch of samples is used.

Below mentioned are the various types of Deep Learning Frameworks.

- Pytorch
- Microsoft Cognitive Toolkit
- Caffe
- Tensorflow
- Keras
- Chainer

The activation function helps in introducing the nonlinearity into the neural network that enables the neural network to learn the complex functions. Without this, it is challenging for the linear function to analyze complex data. An activation function is an artificial neuron that delivers the output based on the input given.

The Auto-Encoders are learning networks that work for transforming the inputs into outputs with no errors or minimized errors. It means the output must be very close to the input. We add a few layers between the input and output and the sizes of these layers would be smaller than the input layer. Actually, the Auto-encoder is provided with the unlabelled input then it would be transmitted into reconstructing the input.

Boltzmann developed simple learning algorithms that allow them to find the important information that was presented in the complex regularities in the data. These machines are generally used to optimize the quantity and the weights of the given problem. The learning program works very slow in networks due to many layers of feature detectors. When we consider Restricted Boltzmann Machines, this has a single algorithm feature detectors that make it faster compared to others.

Random forest is a versatile method in machine learning that performs both classification and regression tasks. It also helps in areas like treats missing values, dimensionality reduction, and outlier values. It is like gathering the various weak modules to come together to form a robust model.

Machine Learning: It is treated as a field of computer science that enables computers to learn from experiences automatically. Below are the three different categories of machine learning.

- Supervised Machine Learning
- Unsupervised Machine Learning
- Reinforcement Machine Learning

Deep Learning: It is a subField of Machine Learning and works with algorithms that mimic the functionality of the human brain.

Reinforcement learning maps the situations to what to do and how to map actions. The end result of this Reinforcement learning is to maximize the numerical reward signal. The learner is not defined with what action to do next but instead must discover which actions will give the maximum reward. Reinforcement learning is developed from the learning process of human beings. It works based on the reward/penalty mechanism.

The process of removing the sub-nodes of the decision node is called pruning. It is also called as Opposite process of splitting.

About Author