As improbable as this may seem now, I was at one time in college a statistics major. After taking all the undergraduate courses in statistics, I enrolled in a graduate course in mathematical statistics at Columbia with the eminent Harold Hotelling, one of the founders of modern mathematical economics. After listening to several lectures of Hotelling, I experienced an epiphany: the sudden realization that the entire “science” of statistical inference rests on one crucial assumption, and that that assumption is utterly groundless. I walked out of the Hotelling course, and out of the world of statistics, never to return.

Statistics, of course, is far more than the mere collection of data. Statistical inference is the conclusions one can draw from that data. In particular, since—apart from the decennial U.S.

census of population—we never know all the data, our conclusions must rest on very small samples drawn from the population. After taking our sample or samples, we have to find a way

to make statements about the population as a whole. For example, suppose we wish to conclude something about the average height of the American male population. Since there is no way

that we can mobilize every male American and measure everyone’s height, we take samples of a small number, say 500 people, selected in various ways, from which we presume to say what

the average American’s height may be.

In the science of statistics, the way we move from our known samples to the unknown population is to make one crucial assumption: that the samples will, in any and all cases, whether we are dealing with height or unemployment or who is going to vote for this or that candidate, be distributed around the population figure according to the so-called “normal curve.”

The normal curve is a symmetrical, bell-shaped curve familiar to all statistics textbooks. Because all samples are assumed to fall around the population figure according to this curve, the statistician feels justified in asserting, from his one or more limited samples, that the height of the American population, or the unemployment rate, or whatever, is definitely XYZ within a “confidence level” of 90 or 95 percent. In short, if, for example, a sample height for the average male is 5 feet 9 inches, 90 or 95 out of every 100 such samples will be within a certain definite range of 5 feet 9 inches. These precise figures are arrived at simply by assuming that all samples are distributed around the population according to this normal curve.

It is because of the properties of the normal curve, for example, that the election pollsters could assert, with overwhelming confidence, that Bush was favored by a certain percentage of voters, and Dukakis by another percentage, all within “three percentage points” or “five percentage points” of “error.” It is the normal curve that permits statisticians not to claim absolute knowledge of all population figures precisely but instead to claim such knowledge within a few percentage points.

Well, what is the evidence for this vital assumption of distribution around a normal curve? None whatever. It is a purely mystical act of faith. In my old statistics text, the only “evidence” for the universal truth of the normal curve was the statement that if good riflemen shoot to hit a bullseye, the shots will tend to be distributed around the target in something like a normal curve. On this incredibly flimsy basis rests an assumption vital to the validity of all statistical inference.

Unfortunately, the social sciences tend to follow the same law that the late Dr. Robert Mendelsohn has shown is adopted in medicine: never drop any procedure, no matter how faulty, until a better one is offered in its place. And now it seems that the entire fallacious structure of inference built on the normal curve has been rendered obsolete by high-tech.

Ten years ago, Stanford statistician Bradley Efron used highspeed computers to generate “artificial data sets” based on an original sample, and to make the millions of numerical calculations necessary to arrive at a population estimate without using the normal curve, or any other arbitrary, mathematical assumption of how samples are distributed about the unknown population figure. After a decade of discussion and tinkering, statisticians have agreed on methods of practical use of this “bootstrap.” method, and it is now beginning to take over the profession. Stanford statistician Jerome H. Friedman, one of the pioneers of the new method, calls it “the most important new idea

in statistics in the last 20 years, and probably the last 50.”

At this point, statisticians are finally willing to let the cat out of the bag. Friedman now concedes that “data don’t always follow bell-shaped curves, and when they don’t, you make a mistake” with the standard methods. In fact, he added that “the data frequently are distributed quite differently than in bell-shaped curves.” So that’s it; now we find that the normal curve Emperor has no clothes after all. The old mystical faith can now be abandoned; the Normal Curve god is dead at long last.

*The above appears as a chapter in Making Economic Sense.*

## No comments:

## Post a Comment