Brieman’s Statistical Modelling
Leo Breiman Statistical Modelling: The Two Cultures Statistical Science 2001, Vol. 16, No. 3, 199–231.
There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move awayfrom exclusive dependence on data models and adopt a more diverse set of tools.
3. PROJECTS IN CONSULTING
As a consultant … I worked on a diverse set of prediction projects. Here are some examples:
- Predicting next-day ozone levels.
- Using mass spectra to identify halogen-containing compounds.
- Predicting the class of a ship from high altitude radar returns.
- Using sonar returns to predict the class of a submarine.
- Identity of hand-sent Morse Code.
- Toxicity of chemicals.
- On-line prediction of the cause of a freeway traffic breakdown.
- Speech recognition.
- The sources of delaying criminal trials in state court systems.
7 Algorithmic Modelling
7.1 A New Research Community
In the mid-1980s two powerful new algorithms for fitting data became available: neural nets and decision trees A new research community using these tools sprang up. Their goal was predictive accuracy. … They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable:
- speech recognition,
- image recognition,
- time series prediction,
- handwriting recognition,
- prediction in financial markets.
7.2 Theory in Algorithmic Modeling
… What is observed is a set of x’s that go in and a subsequent set of y’s that come out. The problem is to find an algorithm f(x) such that for future x in a test set, f(x) will be a good predictor of y. The theoryin this field shifts focus from data models to the properties of algorithms. It characterizes their “strength” as predictors, convergence if they are iterative, and what gives them good predictive accuracy. The one assumption made in the theory is that the data is drawn i.i.d. from an unknown multivariate distribution.
7.3 Recent Lessons
The advances in methodology and increases in predictive accuracy since the mid-1980s that have occurred in the research of machine learning has been phenomenal. There have been particularly exciting developments in the last five years. What has been learned? The three lessons that seem most important to one:
Rashomon: the multiplicityof good models;
Occam: the conflict between simplicityand accuracy;
Bellman: dimensionality—curse or blessing.
Breiman advocates using support vector machines (svms) for the algorithms, and trying to establish the spread of such algorithms that predict the data about equally well.
12. Final Remarks
… The best solution could be an algorithmic model, or maybe a data model, or maybe a combination. But the trick to being a scientist is to be open to using a wide variety of tools.
… One of our failings has, I believe, been, in a wish to stress generality, not to set out more clearly the distinctions between different kinds of application and the consequences for the strategyof statistical analysis.
… [There] are situations where a directly empirical approach is better. Short term economic forecasting and real-time flood forecasting are probably further exemplars. Key issues are then the stability of the predictor as practical prediction proceeds, the need from time to time for recalibration and so on.
However, much prediction is not like this. Often the prediction is under quite different conditions from the data … .
Professor Breiman takes a rather defeatist attitude toward attempts to formulate underlying processes; is this not to reject the base of much scientific progress? … Better a rough answer to the right question than an exact answer to the wrong question, an aphorism, due perhaps to Lord Kelvin, that I heard as an undergraduate in applied mathematics.
I have stayed away from the detail of the paper but will comment on just one point, the interesting theorem of Vapnik about complete separation. This confirms folklore experience with empirical logistic regression that, with a largish number of explanatory variables , complete separation is quite likely to occur. It is interesting that in mainstream thinking this is, I think, regarded as insecure in that complete separation is thought to be a priori unlikely and the estimated separating plane unstable. Presumably bootstrap and cross-validation ideas may give here a quite misleading illusion of stability. Of course if the complete separator is subtle and stable Professor Breiman’s methods will emerge triumphant and ultimately it is an empirical question in each application as to what happens. It will be clear that while I disagree with the main thrust of Professor Breiman’s paper I found it stimulating and interesting.
At first glance Leo Breiman’s stimulating paper looks like an argument against parsimony and scientific insight, and in favor of black boxes with lots of knobs to twiddle
Bruce describes algorithmic techniques similar to svm, as developed for credit scoring in the 60s. The approach was ‘a combination approach’ in the sense that the algorithms were initiated and tweaked to makes sense legally and commercially. He gives advice pon building data models to do the same thing:
[Build] accurate models for which no variable is much more important than other variables. There is always a chance that a variable and its relationships will change in the future. After that, you still want the model to work. So don’t make anyvariable dominant.
… I readily acknowledge that there are situations where a simple data model maybe useful and appropriate; for instance, if the science of the mechanism producing the data is well enough known to determine the model apart from estimating parameters. … Simple models can [also] be useful in giving qualitative understanding, suggesting future research areas and the kind of additional data that needs to be gathered.
At times, there is not enough data on which to base predictions; but policy decisions need to be made. In this case, constructing a model using whatever data exists, combined with scientific common sense and subject-matter knowledge, is a reasonable path.
… [Short-term] economic forecasts and real-time flood forecasts [are] among the less interesting of all of the many current successful algorithmic applications. In [Cox’s] view, the only use for algorithmic models is short-term forecasting;
I agree with the author that the scope of statistics needed increasing to consider the problems and methods that he raises, but my own experience is more like Hoadley’s. In the 1980s became involved with an unattended road-side system used to count and crudely classify vehicles going past. A more sophisticated sensor was being developed. Human operators could provide much greater discrimination between vehicle types, but the algorithm developed to automate the process yielded poor performance. The method being used had been developed somewhat ad-hoc, as a combination of svm and data modelling. As a first step I tried to understand the algorithm from a scientific perspective, looking at the physics, the sensor design and the statistics. I soon found that the manually programmed element of the algorithm had got some physics wrong, which made for acceptable performance. But I also found that some of the design features that bad helped the human operators were a hindrance to the algorithm. This led to a redesign, and further improvements. Now the algorithm was a ‘Black box’ in so far as it used a large feature set with many parameters that were just meaningless numbers. But by understanding the general principles it was nonetheless possible to understand enough to improve the whole sensor -algorithm system, developing appropriate diagnostic techniques. Perhaps it was a ‘Grey box’?
The key take-away here is that what is needed is not just a tool, method or algorithm but the kind of understanding that is provided by an appropriate theory.
On some minor points:
- Much of the discussion is really about what makes for good science.
- The paper’s interpretation of Occam seems a little odd. If Occam is about making as few assumptions as possible, then the paper’s ‘forest’ approach seems sensible. I try to go further in trying to characterise what all the solutions have in common, and what the key differentiators are.
- Following the UK floods of 2007 and the global economic crisis of 2007/8, the remark that algorithmic forecasting was ‘successful’ and even ‘less interesting’ seems odd. Of course, if prediction is the aim, we can just repeatedly predict that there will be no crisis and we will be correct almost all the time.
My own view is that where we want a prediction we often should also want an estimate of that prediction’s reliability. That is, we should know when our data models or algorithms are particularly suspect.