Statistical Modeling :- The Two Cultures by Leo Breiman

Leo Breiman in his paper discusses the dichotomy between two dominant approaches when it comes to prediction of the future values based on the independent input variables and extracting information regarding the nature of how the response variables are associated with the input variables.

In the first approach which is Data Modeling, he explains how most statisticians use predefined data models to estimate only the parameters used for the prediction. The Other approach, Algorithmic Modeling, which was rarely used by statisticians considers complex functions to predict the nature of the data models and the output values. Examples of the data modeling approach are Linear regression models, Logistic Regression and COX Models and examples for Algorithmic Modeling are Decision trees, Neural nets.

By giving an example of the Ozone Project, he states that the Data Modeling approach prioritizes interpretability, predefined model, and hypothesis testing over prediction accuracy. It places a strong emphasis on the theoretical foundations of statistics and the development of efficient estimators. He suggests that in many real-world applications, prediction performance is often of paramount importance, and the Data Modeling approach may fall short in this regard.

In contrast, the Algorithmic Modeling culture, according to Breiman, focuses on developing algorithms and techniques that can automatically learn patterns and make accurate predictions from data. This culture is primarily concerned with predictive accuracy rather than formal inference.

Moreover, by giving a solid explanation regarding the movie - Rashomon, he gives an example that a different set of models can generate the same predictions within a certain level of threshold error. The instability occurs when there are many different models crowded together that have about the same training or test set error. Then a slight perturbation of the data or in the model construction will cause a skip from one model to another. The two models are close to each other in terms of error but can be distant in terms of the form of the model

He introduces the random forest algorithm as an example of a powerful ensemble method that combines multiple decision trees to improve prediction accuracy. He also provides the stats for the accuracy of prediction of different data sets using single tree algorithm and also with random forest. In conclusion, the random forest tree had way more accuracy compared to the single tree algorithm.

In conclusion, his point was If the primary objective is to use data to solve problems, then statisticians need to move away from exclusive dependence on data models and adopt a more diverse set of approaches.