I was being asked the question “What’s the difference between statistics and machine learning?” quite a lot lately, almost as often as this one “What’s the difference between a data analyst and a data scientist”, (which I might write about in another blog.) People wondering about the differences between these subjects probably see a lot of similarity between the two: They are both means to learn about the data, and they share many of the same methods.
The fundamental difference about the two is: statistics is focused on inference and conclusions while machine learning emphasizes on predictions and decisions.
Statisticians care deeply about the data collection process, methodology and statistical properties of the estimator. They are interested in learning something about the data. Statistics may support or reject hypothesis based on the noise of the data, validate models, or make forecasts, but overall the goal is to arrive at a new scientific insight based on the data. It other word, it wants to draw a valid and precise conclusion on problems proposed.
Machine Learning is about making a prediction, and algorithm is just a means to the end. The goal is to solve complex computational task by feeding data to a machine so it will tell us what the outcome will be. Instead of figuring out the cause and effect, we will collect a large amount of examples of what the mechanism should be, and then run an algorithm which is able to perform the task by learning from the examples. it builds model to predict a result, and use data to improve its prediction.
You may have realized that quite a few algorithms used in machine learning are statistical in nature, but as long as the prediction works well, any kind of statistical insight into the data is not necessary.
A paper Statistical Modeling: The Two Cultures published by Leo Breiman in the year 2001 explains the differences between statistics and machine learning very well. I’m going to post the abstract here:
There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.”
In this paper, two cultures are introduced and we can treat Data Modeling Culture as Statistics and Algorithmic Modeling Culture as Machine Learning. (The term Machine Learning still resides mostly in science fictions in 2001.) The following two pictures show clearly the difference between the two cultures.
Data modeling culture assumes a data model and estimates the values from the parameters using the data model.
Algorithmic modeling treats the true algorithm inside the box complex and unknown. It creates another algorithm that operates on x to predict y.
Now I wonder if there will be anyone who can’t wait for my blog about the other frequently asked question, and pop this one: “What about Data Science?”
Data Science employs all the techniques and theories drawn from many fields including mathematics, statistics, information science, computer science, which also includes machine learning, data mining, predictive analytics, etc. to extract knowledge or insights from data. Data scientist is not a new fancy title on name cards; he is a true master of data.