An Accelerated Master’s Degree: Data Mining in a Book (or Two)


Photo by Stuck in Customs

There have been plenty of books recently published on the concept of Deliberate Practice, which essentially says that it takes 10,000 hours of a certain kind of practice (called ‘deliberate’) to gain expertise in something.   It makes sense that the majority of what we want to learn in any discipline is going to be experiential (or gained through practice).  But in order to better understand our experiences, we want to have some kind of framework of what to expect.  We want to develop a theory structure.

Books are what give us this theory structure, and certainly the quality of the theory structure we begin with impacts the amount of deliberate practice we need to become an ‘expert.’  So it’s important to choose the right books, as they will provide the base infrastructure upon which we will layer our experiences.  We’re looking for books that concisely capture the overriding concepts of a particular discipline.

And in any discipline, at least one fairly well defined, there doesn’t need to be that many books to accomplish this.  I would generally say that 3 books or fewer, for each discipline, will give you a proper theory structure.

With that in mind, let’s look at Data Mining… 

Actually – before we dig into the details of data mining, let’s review the concept of data science, from which data mining is born.  At a high level, data science is a set of fundamental principles that guide the extraction of knowledge from data.  This process can be treated systematically by following a process with reasonably well-defined stages.  And the one book that best captures both the principles and the process of data science is Data Science for Business.

A critical skill in data science is the Recursive ability to decompose a data analytics problem into pieces such that each piece matches a known task for which tools are available.  In case this sounds confusing, I’ll explain further.  It simply means that, within data science, we have a certain number of tasks available to us.  Each of these tasks can assist in solving certain aspects of problems.  So when we have a large data analytics problem, we simply break it down into pieces small enough to be solvable by the tasks we currently know.  Examples of these tasks would be classification, regression, and Clustering.  This is as true in data science as it is in data mining.

Now what usually separates data mining from other types of data analytics is the quantity of data.  So data mining involves the same knowledge extraction process of data science, but with the exception that the information extracted is from large data sets or databases.

Data Mining is a fairly new discipline that focuses on the automated search for knowledge, patterns, or regularities from data.  In other words, data mining seeks to find informative attributes from data.  This could be with an aim to predict the future or to detect unsuspected features in the data.  And the one book that best covers the core concepts of data mining is a text called (oddly enough) Principles of Data Mining.  It is, quite literally, the Gospel of Data Mining.

In practicing data mining, the Scientific Method is used to generate a hypothesis about the data, and then test the hypothesis.  This is fairly straightforward.  There are, however, numerous data mining techniques available to test the hypothesis.  And Statistics lies at the heart of them all.  This is because the most fundamental method for analyzing very large datasets is to first select a subset of the data to analyze.  It’s more efficient that way.  And this leads us to make inferences about large data sets from smaller subsets.  And that, my friend, is Statistics.  Statistics allows us to make statements about population structures, to estimate the size of these structures, and to state our degree of confidence in them, all on the basis of a sample (or subset).  And because data mining involves large data sets, we should expect to obtain statistical significance.

The results of our statistics-driven data mining exercises are either models or patterns. So the input is data; the output is a model or pattern.  This is how we express the relationships or summaries that we derive from data.  Generally speaking, a model is a simplified representation of reality created to serve a purpose.

Our goal is to draw a sample from the database that allows us to construct a model that reflects the structure of the data in the database.  The model may be descriptive: summarizing the data in a convenient and concise way.  Or the model may be inferential: allowing us to make a statement about the population from which the data were drawn or about likely future data values.  Either way, modeling, like data analysis in general, is an iterative process.  We fit a model to the sample data, and then modify or extend it in light of the results.  Rinse and repeat.

When we speak of models, we are specifically talking about linear models.  These are the models driven from linear equations (the stuff we learned in Algebra).  And these models underlie linear regression, which is widely used in data mining.  Regression involves minimizing a sum of squares, or the least squares method.  This is done to ultimately “fit” a line to various data points, and the concept is ubiquitous in data mining.  Of course, we don’t expect linear models to perfectly explain the data.  Even in the best of circumstances there will be small nonlinear effects that we will be unable to capture in a model.

So Statistics lies at the heart of the techniques (or methods) that are used in data mining, and the ultimate result of our data mining is either a model or a pattern.  That still doesn’t explain the methods themselves.  There are quite a few methods that analysts rely on, but some of the more common ones are:

  • Decision Trees/Rules: These are (not surprisingly) tree-shaped structures that represent sets of decisions.  One of the most widely used decision trees is called Classification & Regression Trees (CART), which aims to find a tree close to the optimal tree size.  CART tries to find a model that is complex enough to capture any structure that exists, but not so complex that it “overfits.”  It uses what is called a “greedy” local search method to identify good candidate tree structures, recursively expanding the tree from a root node (or branch), and then gradually “pruning” back specific branches of the tree.
  • Regression: As mentioned above, this is one of the primary methods of better understanding data.
  • Clustering: The aim of clustering is to divide data into naturally occurring regions in which the points are closely or densely clustered.  It is an automated process to group related records together, and then examine the attributes and values that define the clusters or segments.
  • Classification:  This is similar to clustering in that is also segments data into distinct segments (in this case, called classes).  But unlike clustering, classification requires that the analyst know ahead of time how classes are defined.
  • Visualization: Visualization is important in data mining because it is ideal for sifting through data to find unexpected relationships.  On-line Analytical Processing (OLAP) systems are designed to facilitate this visual or manual exploration of the data.  OLAP essentially provides an easy to use graphical user interface (GUI) to query large data collections.  The idea of “on-line” processing is that it is done in real-time, so that analysts can find answers quickly and efficiently.  OLAP, however, does not allow ad hoc querying (such as would be done in SQL).  For OLAP, the dimensions of analysis must be pre-programmed into the OLAP system.
  • Neural Networks: These are non-linear predictive models that learn through training and resemble biological neural networks in structure.  This is all the rage right now!
  • Naïve Bayes: This is derived from Bayesian Statistics, and has the ability to tease out more subtle information from data.

And these are just some of the methods available for better understanding data!  The list goes on.  Unfortunately, there is no set method that should be uniformly used to understand data better.  The method will always depend on the unique nature and features of the problem to be solved, the availability and quality of data, and the ultimate objectives.  But the objective of Data Mining is quite clear: extract knowledge from data.