19 Data Science Interview Questions for Professionals

Data science is an interdisciplinary field about scientific methods, algorithms, process and systems, which are employed to extract meaningful information from mass of data to make decisions based on that information.  Here we list a few important data science questions that are asked to both aspiring data scientists as well as professionals who have had previous experience in this field. We present here few of the advance level questions you can expect in interviews.

1. Explain Auto-Encoder

Auto encoders are learning networks which are used to transform inputs into outputs with as few errors as possible so that output can be as close to input as possible.

2. Define Boltzmann Machine

Boltzmann machine is a simple learning algorithm used to identify complex regularities in the training data.  

3. When under fitting occurs in a static model?

Under fitting occurs when a statistical model or machine learning algorithm fails to accurately capture the underlying trend of the data.

4. What are the most commonly used algorithms by Data Scientists?

Data scientists make use of four algorithms. They are as following: 

  • Logistic regression
  • Linear regression
  • Random Forest
  • KNN

5. What is a recall?

A recall determines the ratio between what is the true positive rate with respect to the actual positive rate. It ranges from 0 to 1.

6. How will you capture the correlation between continuous and categorical variable?

Analysis of covariance technique is used to capture the association between continuous and categorical variables.

7. What is p-value?

P-value ranges from 0 and 1. It helps you define the strength of your hypothesis test in statistics.

8. What are Recommender Systems?

It is a subclass of information filtering systems that is used to predict the rating that a product or a service is likely to elicit from the end-user.  They are primarily used in movies, news, music, etc.

9. What is bias? 

Bias refers to the error that creeps in owing to the oversimplification of a machine learning algorithm. This error is the main cause of under fitting. 

10. What is Naive Bayes algorithm used for? 

The Naive Bayes Algorithm model based on the Bayes Theorem is used to determine the probability of occurrence of an event. 

Naive Bayes is ideal for practical application in text mining. However, it also involves an assumption that it is not possible to visualize in real-time data. Naive Bayes consists of the calculation of conditional probability from the pure product of individual probabilities of different components. The condition in such cases would imply complete independence for the features that are practically not possible or very difficult. 

11. How does an ROC curve work?

The ROC curve is used to represent graphically the disparity between true positive rates and false positive rates at various thresholds.

12. Discuss Decision Tree algorithm 

A decision tree is a versatile supervised machine learning algorithm, primarily employed for Regression and Classification. It works by breaking down a dataset into smaller subsets. The decision tree comes equipped with the ability to handle both categorical and numerical data. 

13. What is Random Forest? How does it work?

Random forest is a popular machine learning method which possesses the ability to perform both regression and classification tasks. It is a type of ensemble learning method, where a variety of weak models unify to form a powerful model.

In Random Forest, as the name suggests, we grow multiple trees in contrast to a single tree, where each tree offers a classification. The forest picks up the classification with the maximum number of votes, and in case of regression, it takes the average of outputs by different trees.

14. Which between Python and R do you think is better suited for text analytics and why?

 For text analytics, the preferred option would be Python for the following reasons:

  • The Pandas library offers easy-to-use data structures in addition to high-performance data analysis tools
  • Python has the ability to perform faster for all types of text analytics
  • R is preferable for machine learning instead of text analysis

15. What are the disadvantages of using a linear model?

  • Assumption of linearity of the errors.
  • This model is unsuitable for binary or count outcomes

There are plenty of over fitting problems it lacks the ability to solve

16. Name the cross-validation technique you would employ while dealing with time series data set?

Forward chaining should be used in case of time series data. You need to model on past data then look at forward-facing data.

fold 1: training[1], test[2]

fold 1: training[1 2], test[3]

fold 1: training[1 2 3], test[4]

fold 1: training[1 2 3 4], test[5]

17. How does Back-propagation work? Also, state its various variants.

Back-propagation is the essence of neural net training. Back-propagation algorithms are training algorithms employed for multi-layer neural networks. It is an efficient method of computing the gradients of the loss function with respect to the neural network parameters. In other words, backprop is about computing gradients for nested functions, represented as a computational graph, using the chain rule.

There are three main variations of back-propagation: stochastic (also called online), batch and mini-batch.


18. How will you define the number of clusters in a clustering algorithm?

The primary purpose of clustering is to group together similar identities in a way so as the entities within the group remain same but the groups remain dissimilar to one another.

19. What is TF/IDF vectorization?

tf–idf stands for term frequency–inverse document frequency. It is a numerical statistic that is used to determine the importance of a word in a document in a collection or corpus.


Go through these carefully and look for additional information from different sources as well to ace your interview and bag your dream job.

1 comment

Comments are closed.