Blog of McCall Baldwin

View article online - http://mccallbaldwin.com/an-accelerated-masters-degree-statistics-in-a-book-or-two/


An Accelerated Master's Degree: Statistics in a Book (or Two)

There have been plenty of books recently published on the concept of Deliberate Practice, which essentially says that it takes 10,000 hours of a certain kind of practice (called ‘deliberate’) to gain expertise in something.   It makes sense that the majority of what we want to learn in any discipline is going to be experiential (or gained through practice).  But in order to better understand our experiences, we want to have some kind of framework of what to expect.  We want to develop a theory structure.

Books are what give us this theory structure, and certainly the quality of the theory structure we begin with impacts the amount of deliberate practice we need to become an ‘expert.’  So it’s important to choose the right books, as they will provide the base infrastructure upon which we will layer our experiences.  We’re looking for books that concisely capture the overriding concepts of a particular discipline.

And in any discipline, at least one fairly well defined, there doesn’t need to be that many books to accomplish this.  I would generally say that 3 books or fewer, for each discipline, will give you a proper theory structure.

With that in mind, let’s look at Statistics… 

Statistics is the study of how to collect, organize, analyze, and interpret numerical information from data.  The most basic task when working with data is to summarize a great deal of information.  And in our attempts to summarize information, we infer things.  In other words, statistics is not out to prove anything with certainty.  It is, however, out to increase our understanding of something, and it usually does that through Statistical Inference.  We gain statistical inference from observing some pattern of outcome and then using Probability (the study of events and outcomes involving an element of uncertainty) to determine the most likely explanation for that outcome.

Because statistical analysis rarely unveils “the truth,” we are usually building a circumstantial case (based on imperfect data).  We have to be ok with this messiness.  After all, we live in a world of approximations and Incompleteness, as noted by Godel.  So although the field of statistics is rooted in mathematics, and mathematics is exact, the use of statistics to describe phenomena is not exact.

Statistics as a subject is broken down into two branches: Descriptive Statistics and Inferential Statistics.  It’s actually pretty straightforward.  Descriptive statistics involves organizing, picturing and summarizing data.  Inferential statistics involves using information from a sample to draw conclusions about the population.  Both branches heavily involve the use of variables to explain the data, and two very important variables to differentiate are qualitative vs. quantitative.

Qualitative aspects of statistics invoke what is referred to as Natural Inference.  Natural inference begins with an intention to explore a particular area, collects “data” (i.e. observations and interviews), and generates ideas and hypotheses from these data largely through what is known as Inductive reasoning.  The strength of qualitative research lies in its validity (or closeness to the truth).

Quantitative aspects of statistics are generally referred to as Statistical Inference (mentioned above).  In Statistical Inference, we should begin with an idea (usually called a hypothesis) which, through measurement, generates data allowing us to make a conclusion (in other words, we should use the Scientific Method).  The strength of the quantitative approach lies in its reliability (or repeatability).

In our attempts to describe data, we usually start with those descriptions that are easiest to understand.  These means we start with what are called measures of center (or central tendency).  In order for someone to get an initial understanding of the data, these are usually the first descriptions offered.  They include mean, median, and mode.  For a more exhaustive explanation of these measures, it’s best to turn to Sal Kahn.

Measures of center can give us some good introductory information about our data, but we can further explore the data through measures of spread (or dispersion).  In other words, how dispersed are the data points – how far away from each other are they?  Variance and standard deviation are the two most commonly used measures of spread, and they are interrelated.

Variance is calculated by 1) squaring the distance between each observation and the mean, and then 2) taking the average of all squared distances.  Because the difference between each observation and the mean is squared, the formula for calculating variance puts particular weight on observations that are far from the mean, or outliers.

The standard deviation for a set of observations is the square root of the variance.  So basically, a more compact version of the variance – it shows dispersion at a scale lower than variance.

Between measures of center and spread, we’ve pretty much covered descriptive statistics.  We can now move on to inferential statistics.  The first topic to cover in inferential statistics is regression, which will expand upon the measures of spread just discussed in descriptive statistics.

Regression is a measure of association.  It’s an attempt to describe an entire set of data with one single line: the line that “best fits” the linear relationship between a set of variables.  In order to understand regression, we have to first understand correlation.

Correlation measures the degree to which two phenomena are related to one another.  In order to fully understand correlation though, we have to understand why we might observe a correlation between phenomena.  There are three reasons why this could happen:

  1. There is a logical relationship between the two (or more) variables (for example, interest rates and mortgage rates)
  2. There is another external factor affecting both variables (for example, weather during construction of a building will affect how long it takes to complete construction)
  3. The observed correlation occurred purely by chance and no correlation actually exists

Of these three, we care about the first reason – it demonstrates relatedness.  And when we observe this relatedness (or correlation) between phenomena, we are able to make educated inferences.  We do have to be careful to understand the difference between Correlation and Causation, though.  Establishing correlation between two variables does not establish causation.  These are two completely different concepts.  We cannot know for certain that a relationship is causal (a change in one variable is causing a change in the other variable) simply because it is highly correlated.

Much like correlation, regression is an attempt to get predictive about data.  Regression is a mathematical equation that allows one variable to be predicted from another variable.  A regression analysis takes currently available data, translates it into a line that “best fits” the data, and then uses that line to predict future outcomes.  Regression typically uses a methodology called ordinary least squares, which fits a line by minimizing the sum of the squared distances between each data point and the ideal line.  Ordinary least squares gives us the best description of a linear relationship between two variables.  If the squared errors are small, it tells us that the line is a really good fit.  This goodness of fit is measured by a term called R-Squared.

Regression is predictive because it gives us an idea of what to expect from future data.  If a new data point is introduced that is highly distinct from the “best fit” line, we expect additional data points to converge back towards the “best fit” line.  This concept is often referred to as Regression to the Mean, which is the idea that things even out over time.

Regression is a very important part of inferential statistics, but the Normal Distribution (often called the “bell curve”) is at the core of almost everything we do in inferential statistics.  In fact, the normal distribution is arguably the most important concept in all of statistics.

The power of the normal distribution lies in the fact that it’s a parametric test.  All statistical tests are either parametric (i.e. they assume that the data were sampled from a particular form of distribution, such as a normal distribution) or non-parametric (i.e. they do not assume that the data were sampled from a particular type of distribution).  In general, parametric tests are more powerful than non-parametric ones and so should be used if at all possible.

The beauty of the normal distribution is that we know by definition exactly what proportion of the observations in a normal distribution lie within one standard deviation of the mean (68.2%), within two standard deviations of the mean (95.4%), within three standard deviations (99.7%).

The wide applicability of the normal distribution is based on the central limit theorem.  The core principle underlying the central limit theorem is that a large, properly drawn sample will resemble the population from which it is drawn.  The easiest way to gather a representative sample of a larger population is to select some subset of that population randomly.

It turns out that the normal distribution describes many common phenomena.  However, when working with a small sample of data, the normal distribution is not helpful.  In these situations, the t-distribution makes more sense.  Another way to think about the applicability of the normal distribution vs. the t-distribution is through Degrees of Freedom.  Degrees of Freedom are roughly equal to the number of observations in a sample; the more degrees of freedom we have, the more confident we can be that our sample represents the true population, and the “tighter” our distribution will be.  When the number of degrees of freedom gets large, the t-distribution converges to the normal distribution.

While the normal distribution is the core of almost everything we do in inferential statistics, any statistical inference must begin with a hypothesis.  In statistics, the hypothesis that we begin with is called the null hypothesis.  This is our starting assumption (think of it as the status quo).  If we reject the null hypothesis, then we typically accept some alternative hypothesis that is more consistent with the data observed.

When we perform a hypothesis test, a p-value (probability value) helps us determine the significance of our results.  We use this p-value to determine whether to accept or reject the null hypothesis.  We can set the p-value to various levels, but the most common one is 5%, which is considered to be “statistically significant.”  A p-value less than this amount indicates strong evidence against the null hypothesis, which means we reject it and accept the alternative.  In statistics-speak, a significance level of 5% means we can “reject a null hypothesis at the .05 level if there is less than a 5% chance of getting an outcome at least as extreme as what we’ve observed if the null hypothesis were true.”

If a significance level of 5% seems somewhat arbitrary, that’s because it is.  There is no single standardized statistical threshold for rejecting a null hypothesis.  Significance levels of both 1% (deemed “statistically highly significant”) and 10% are also reasonably common thresholds for doing this kind of analysis.

The discussion on p-values demonstrates that we are working with probabilities, not certainties.  And in our efforts to test our hypotheses, it is important to account for the possibility of error.  In hypothesis testing, we care about two errors: Type I and Type II.

A Type I Error is wrongly rejecting a null hypothesis (a false positive, or wrongly classifying legitimate email as spam).  In other words, deciding the null hypothesis is false, when in fact it is true; declaring a difference, when in fact no difference exists.  A Type II Error is failing to reject when in fact the null hypothesis is false (a false negative, or letting spam into the inbox).

All these concepts and processes can seem mind-bending at first, but as we work through them (slowly), they start to make sense.  After enough practice, it really does become routine.  Of course, at that point, we can move on to non-normal distributions (which should be saved for another day).  I will say that Chris Whelan does about as good a job as I’ve seen in describing statistics in a simple-to-understand manner in his book Naked Statistics.  It’s worth a read in order to become more conversant in Statistics.

Now, a discussion on statistics would be remiss without mention of the numerous Cognitive Biases that plague our ability to perform good statistics.  Most notable in this group are the following:

Given all we’ve just discussed, statistics can seem unwieldy at times.  But statistics is really just the science of likelihood, and we use it every day to reduce uncertainty.  It gets a bit more complicated at advanced levels, but the realities remain the same.  When used properly, statistics helps us gain new insight or knowledge.