Top 50 Data Science interview Questions and Answers

Spread the love

Hardvard Business Review is referring to Data scientists as the best job of the 21st century. Glassdoor placed it as one of the best jobs in American lists. According to IBM, demand for this role with extending to 28% by 2020.

It should be no surprise for anyone that in the new era of Big Data and Machine learning, Data scientists are the rockstars. Companies having massive amounts of data have to improve the way they serve to the customers, builds their products and run their operations which will be positioned to be in this economy.

  1. What are the differences between supervised and unsupervised learning?

Supervised learning :

It uses labelled and known data as input

Has a feedback mechanism

Mostly used algorithms are decision trees, logistics regression and support vector machine.

Unsupervised learning:

It uses unlabeled data as input

It has no feedback mechanism

Most commonly used algorithms are k means clustering and apriori algorithm

  1. How is logistic regression done?

It measures the relationship between the dependent variable and independent variable, which can be one or more by estimating probability using it’s underlying logistics function.

  1. Explain the steps in making a decision tree 
  • Considering the whole data as input
  • Calculation of entropy of the target variable as well as the predictor attributes
  • Calculation of your information which helps in gaining of all attributes
  • Choosing the attributes with the highest information gain.
  1. How do you build a random forest model?

A random forest is built up with a—number of the decision tree. If you split the data into different kinds of packages and make a decision tree in each of the different groups of data, then the random forest brings all those trees together.

  1. How can you avoid the overfitting your model?

There are three main methods to avoid overfitting:

Keeping the model simple

Using the cross-validation techniques

Using the techniques which are regularized and can penalize some model parameters

  1. Differentiate between univariate, bivariate and multivariate analysis


This data only contains one variable and the purpose of this analysis is used to describe the data which is input and find patterns that existed within it


This involves two different variables and this analysis of data is dealt with the cause and relationships and is done to determine the relationship between the two variables.


It includes three or more variables. It is similar to the bivariate, but the difference is that it just contains more than one dependent variable.

  1. What are the feature selection methods used to select the right variables?

There are two main methods:

Filter method: this involves the analysis of linear discrimination, ANOVA and Chi-Square.

Wapper methods: it consists of forwarding Selection, Backward Selection and Recursive Feature Elimination.

  1. You are given a data set consisting of variables with more than 30 per cent missing values. How will you deal with them?

If the data set is large, then the rows can be simply removed which have missing data values as it is the quickest way.

For the smaller data sets, we can substitute its missing values with the average of the data’s already present by using a panda’s data frame in Python.

  1. For the given points, how will you calculate the Euclidean distance in Python?

Plot 1=[1,4]

Plot2= [2,6]

The Euclidean distance can be calculated as follows:

euclidean _ distance = sqrt ((plot 1[0]- plot2[0])**2+ (plot 1[1]- plot2[1]**2)

  1. What are dimensionality reduction and its benefits?

Dimensionality reduction is the process of converting a data set which has vast dimensions into data with very few dimensions to convey similar information concisely.

This reduction helps in the reduction of storage space and compresses the data. It even reduces the computation time.

  1. How should you maintain a deployed model?

The steps to maintain are:

Monitor: Constant monitoring is used to determine the performance accuracy.


Metrics evaluation of the model is calculated to determine whether a new algorithm is needed or not.

Compare: The New models are allowed to compare with each other.

Rebuild: The model which performs the best is rebuilt in the current state.

  1. What is the recommender system?

It is a system that predicts what a user would rate a specific product based on their preferences.

  1. How do you find RMSE and MSE in a linear regression model?

These two are the most common measures of accuracy used in linear regression. RMSE indicates the Root Mean Square Error while MSE stands for Mean Square Error.

  1. How can you select k fork means?

The elbow method is used to select k for k means clustering. This idea is to run k means clustering on the data set where k stands for the number of clusters.

  1. What is the significance of p-value?

p-value typically <= 0.05

This gives a piece of strong evidence against the null hypothesis.

p-value typically > 0.05

This gives a shred of weak evidence against the null hypothesis.

p-value at cutoff 0.05

It is said to be marginal.

  1. How can outlier values be treated?

You can only drop an outlier only if the value is garbage.

  1. How can a time-series data be declared as stationery?

It is stationary as the variance, and the mean of the given series is kept constant with the time.

  1. People who bought this also bought…. Recommendation seen on Amazon are a result of which algorithm?

The recommendation engine is accomplished with collaborative filtering. Collaborative filtering starts explaining the behaviour of other users and their purchase history in terms of ratings, selection, etc.

  1. Write the basic SQL query that lists all orders with customer information.

We ordered tables and custom tables that contain the following columns:

  • Order Table consists of ordered, Customer I’d, Order Number and the total amount
  • Customer Table consists of Id, First name, last name, city, country.

The SQL series is as follow:

Select Order Number, Total amount, First Name, last name, city and country.

From order to join the customer

ON order. CustomerId = Customer .I’d

  1. You are given a dataset on cancer detection. You have built a classification model and achieved an accuracy of 96 %. Why shouldn’t you be happy with your model performance?

Cancer detection gives the I’m balanced data. In this dataset, accuracy isn’t a measure of performance. It’s important to focus on the remaining 4%, which represents the patients who were diagnosed with wrongly. Early diagnosis is crucial when it comes to the detection of cancer and can improve a patient’s prognosis.

  1. Which of the following machine learning algorithms can be used for inputting missing values of both categorical and continuous variable?

The K nearest neighbour algorithm is used because it can easily compute the nearest neighbour and when it doesn’t have any value, it just computes the neighbour near to it with having all the features.

  1. We want to predict the probability of death from heart disease based on three risk factors: age, gender and blood cholesterol level. What is the most appropriate algorithm from this case? 

Choose the correct option:

Logistics regression

Linear regression

K means clustering

Apriori algorithm

The most appropriate algorithm for the above-mentioned case is logistic regression.

  1. After studying the behaviour of the population, you have identified four specified individual types that are valuable to your study. You would like to find all users who are most similar to each individual type. Which algorithm is most appropriate for this study?

K means clustering

Linear regression

Association rules

Discussion tree

The most appropriate algorithm is K means clustering.

  1. You have run the association rules algorithm on your dataset and the two rules {banana, apple} => {grapes} and {apple, orange} => {grapes} have been found to be relevant. What else must be true?

{Banana, apple, grapes, orange} is said to be a frequent itemset.

  1. Your organization has a website where visitors randomly receive one of two coupons. It is possible that visitors to the website will not receive a coupon. You have been asked to determine if offering a coupon to website visitors has any impact on their purchase decision. Which analysis method should you use?

The analysis method used for the above question is One -Way ANOVA.






Be the first to comment on "Top 50 Data Science interview Questions and Answers"

Leave a comment

Your email address will not be published.