Books are what give us this theory structure, and certainly the quality of the theory structure we begin with impacts the amount of deliberate practice we need to become an ‘expert.’ So it’s important to choose the right books, as they will provide the base infrastructure upon which we will layer our experiences. We’re looking for books that concisely capture the overriding concepts of a particular discipline.

And in any discipline, at least one fairly well defined, there doesn’t need to be that many books to accomplish this. I would generally say that 3 books or fewer, for each discipline, will give you a proper theory structure.

With that in mind, let’s look at Data Mining…

Actually – before we dig into the details of data mining, let’s review the concept of data science, from which data mining is born. At a high level, data science is a set of fundamental principles that guide the extraction of knowledge from data. This process can be treated systematically by following a process with reasonably well-defined stages. And the one book that best captures both the principles and the process of data science is **Data Science for Business**.

A critical skill in data science is the **Recursive** ability to decompose a data analytics problem into pieces such that each piece matches a known task for which tools are available. In case this sounds confusing, I’ll explain further. It simply means that, within data science, we have a certain number of tasks available to us. Each of these tasks can assist in solving certain aspects of problems. So when we have a large data analytics problem, we simply break it down into pieces small enough to be solvable by the tasks we currently know. Examples of these tasks would be classification, regression, and **Clustering**. This is as true in data science as it is in data mining.

Now what usually separates data mining from other types of data analytics is the quantity of data. So data mining involves the same knowledge extraction process of data science, but with the exception that the information extracted is from large data sets or databases.

Data Mining is a fairly new discipline that focuses on the automated search for knowledge, patterns, or regularities from data. In other words, data mining seeks to find informative attributes from data. This could be with an aim to predict the future or to detect unsuspected features in the data. And the one book that best covers the core concepts of data mining is a text called (oddly enough) **Principles of Data Mining**. It is, quite literally, the Gospel of Data Mining.

In practicing data mining, the **Scientific Method** is used to generate a hypothesis about the data, and then test the hypothesis. This is fairly straightforward. There are, however, numerous data mining techniques available to test the hypothesis. And Statistics lies at the heart of them all. This is because the most fundamental method for analyzing very large datasets is to first select a subset of the data to analyze. It’s more efficient that way. And this leads us to make inferences about large data sets from smaller subsets. And that, my friend, is Statistics. Statistics allows us to make statements about population structures, to estimate the size of these structures, and to state our degree of confidence in them, all on the basis of a sample (or subset). And because data mining involves large data sets, we should expect to obtain statistical significance.

The results of our statistics-driven data mining exercises are either *models* or *patterns*. So the input is data; the output is a model or pattern. This is how we express the relationships or summaries that we derive from data. Generally speaking, a model is a simplified representation of reality created to serve a purpose.

Our goal is to draw a sample from the database that allows us to construct a model that reflects the structure of the data in the database. The model may be descriptive: summarizing the data in a convenient and concise way. Or the model may be inferential: allowing us to make a statement about the population from which the data were drawn or about likely future data values. Either way, modeling, like data analysis in general, is an iterative process. We fit a model to the sample data, and then modify or extend it in light of the results. Rinse and repeat.

When we speak of models, we are specifically talking about *linear* models. These are the models driven from linear equations (the stuff we learned in **Algebra**). And these models underlie *linear* regression, which is widely used in data mining. Regression involves minimizing a sum of squares, or the least squares method. This is done to ultimately “fit” a line to various data points, and the concept is ubiquitous in data mining. Of course, we don’t expect linear models to perfectly explain the data. Even in the best of circumstances there will be small nonlinear effects that we will be unable to capture in a model.

So Statistics lies at the heart of the techniques (or methods) that are used in data mining, and the ultimate result of our data mining is either a model or a pattern. That still doesn’t explain the methods themselves. There are quite a few methods that analysts rely on, but some of the more common ones are:

- Decision Trees/Rules: These are (not surprisingly) tree-shaped structures that represent sets of decisions. One of the most widely used decision trees is called Classification & Regression Trees (CART), which aims to find a tree close to the optimal tree size. CART tries to find a model that is complex enough to capture any structure that exists, but not so complex that it “overfits.” It uses what is called a “greedy” local search method to identify good candidate tree structures, recursively expanding the tree from a root node (or branch), and then gradually “pruning” back specific branches of the tree.
- Regression: As mentioned above, this is one of the primary methods of better understanding data.
- Clustering: The aim of clustering is to divide data into naturally occurring regions in which the points are closely or densely clustered. It is an automated process to group related records together, and then examine the attributes and values that define the clusters or segments.
- Classification: This is similar to clustering in that is also segments data into distinct segments (in this case, called classes). But unlike clustering, classification requires that the analyst know ahead of time how classes are defined.
- Visualization: Visualization is important in data mining because it is ideal for sifting through data to find unexpected relationships. On-line Analytical Processing (OLAP) systems are designed to facilitate this visual or manual exploration of the data. OLAP essentially provides an easy to use graphical user interface (GUI) to query large data collections. The idea of “on-line” processing is that it is done in real-time, so that analysts can find answers quickly and efficiently. OLAP, however, does not allow ad hoc querying (such as would be done in SQL). For OLAP, the dimensions of analysis must be pre-programmed into the OLAP system.
- Neural Networks: These are non-linear predictive models that learn through training and resemble biological neural networks in structure. This is all the rage right now!
- Naïve Bayes: This is derived from Bayesian Statistics, and has the ability to tease out more subtle information from data.

And these are just *some* of the methods available for better understanding data! The list goes on. Unfortunately, there is no set method that should be uniformly used to understand data better. The method will always depend on the unique nature and features of the problem to be solved, the availability and quality of data, and the ultimate objectives. But the objective of Data Mining is quite clear: extract knowledge from data.

But many commonly used metrics don’t provide any actionable insight. In other words, they’re just for show. These are called vanity metrics. Other times metrics don’t properly measure the underlying data, potentially resulting in what only *appears* to be a valid metric on the surface. This is called an **Isomorphism**.

A metric is only as valuable as its ability to decipher underlying data. When metrics *are* properly developed and implemented, they become meaningful because they capture the drivers that lead to the behaviors and decisions desired.

A great resource for understanding metrics is the book Lean Analytics. Although geared to start-ups, the logic used is widely applicable to organizations large and small. You will find much of this logic in the following paragraphs.

In an effort to limit confusion and concentrate focus, our search for meaningful metrics should be aimed towards finding the one metric that matters the most. If we **Optimize** the organization to **Maximize** this one metric, it will reveal the next place for us to focus our efforts. And we continue this process over and over again, improving this one metric (through experimentation) until it is good enough for us to move on the “new” most important metric.

A good metric is usually comparative, understandable, and takes the form of a ratio or rate. It can be either a lagging (descriptive) or leading (predictive) indicator. At the outset, we’re stuck with lagging indicators since we don’t have much data to work with. However, once we do have enough data, we want a mix of both leading and lagging indicators. And if possible (once we have the data), we want our metrics to be more leading indicator-focused than lagging indicator-focused. Becoming too past-focused (through lagging indicators) can cause an organization to stagnate. That hasn’t stopped the typical organizational scorecard from including 80-90% past-focused metrics, though.

Ideally, a scorecard or collection of performance metrics for an organization should consist of about 75% leading and 25% lagging metrics. The best leading indicators commence at the beginning of the customer lifecycle. From the very first “touch point” (in marketing-speak), we should be collecting data. This data will ultimately help us craft the proper leading indicators.

Collecting data doesn’t just lead to better metrics (and hence, better decisions). It also improves organizational efficiency. Good metrics can create a flatter, more autonomous organization once everyone buys in to a data-informed approach. There is no longer the need to propagate decisions across an organization. We can empower employees to make more decisions themselves, provided they have the data in place to support them. We can create a culture of responsibility.

]]>Information is data that is aggregated to a level where it makes sense for decision support (usually in the shape of reports, tables, or lists). Knowledge is information that has been analyzed and interpreted. With the growth of “big data,” the race is on to convert this data to beneficial knowledge. Almost all aspects of life are being “datafied,” or turned into data.

In our efforts to extract knowledge from data, we have to understand the value of information. There is only so much information that can provide clear benefit. As an extension of **Zipf’s Law**, each successive attempt to dig deeper into a data set will yield exponentially weaker meaning. At some point, there is **Cost/Benefit** relationship that needs to be considered when digging deeply into data. For this reason, we value information – we want to know how important each piece of information will be. And there are only three basic reasons why information ever has value to a business:

- Information reduces uncertainty about decisions that have economic consequences
- Information affects the behavior of others, which has economic consequences
- Information sometimes has its own market value

We will focus on the first two reasons in this post, as they directly influence decision-making inside an organization. Given the *why*, we can now look at the *how.* For this explanation, I am relying heavily on the (pretty much) amazing book How to Measure Anything by Douglas Hubbard.

If we’re trying to value information for the affect it has on the behavior of others, the value is exactly equal to the value of the difference in human behavior. This is actually fairly logical.

If we’re trying to value information for its ability to reduce uncertainty in decision-making, there is a bit more to it. We first need to understand what is called the “Expected Opportunity Loss” or “EOL” for a particular strategy. The EOL is the chance of being wrong multiplied by the cost of being wrong (for each scenario in a given strategy). For example, let’s say there is a project that is being reviewed for approval. It can either be approved or rejected. We place a cost on each possibility – approval being the actual cost of the project and rejection being the foregone gain of the project. We then place a likelihood of each option happening. With this information, we can calculate the EOL for each scenario in the project.

Let’s say that (in the example above) the EOL for approval is $2 million and the EOL for rejection is $24 million. We can now calculate what’s called the “Expected Value of Perfect Information” or “EVPI.” This is simply the EOL before any information is introduced. In this example, we have yet to introduce information, so the EVPI for project approval is $2MM and the EVPI for project rejection is $24MM. Another way to think about EVPI is the gain received from eliminating uncertainty.

If we can only* reduce* *but not eliminate* uncertainty, we still want a way to value that reduction. The value of this reduction (the difference between the EOL before a measurement and the EOL after a measurement) is called the “Expected Value of Information” or “EVI.” It’s purpose is to value the reduction in risk, which *is* the *value of information*.

For the vast majority of variables that we can examine, the current level of uncertainty is acceptable. In other words, the vast majority of variables have an information value of zero. But for those that do have information value, we devote measurement attention through concepts like EOL, EVPI, and EVI. Oftentimes, the economic value of measuring a variable is **Inversely** proportional to how much measurement attention it gets. There can be great value in attending to previously unattended information.

As more and more data is compiled, there will be more opportunities for better measurement of information. We often start the measurement process by reviewing the available historical information, and this is fine. Even though we have no logical reason for believing the future will resemble the past, using historical data to measure (or infer) is probably an improvement on unaided human judgment. And after all, we have to start somewhere.

]]>In order to make these decisions, we have to understand the ultimate value that various combinations of this data can present. So, we measure it. That is, we measure what data carries: information. Measurement is what informs uncertain decisions, and almost all decisions are made under uncertainty.

Measurement is a very particular process, and prior to carrying it out in analytics, we need to address certain questions:

- What decision will this measurement support?
- What observable consequences are we measuring?
- How will measuring these observable consequences matter for the decision we’re supporting?
- How much do we currently know (what’s our current level of uncertainty)?
- What is the value of additional information?

When we speak of measurement, we mean a quantitatively expressed reduction of uncertainty based on one or more observations. So, if we have no idea how long a bridge is, we have complete uncertainty. But if we then say that we are 90% confident that the bridge is between 10 and 10,000 feet long, we have reduced uncertainty about the length of the bridge. *That’s* a measurement. This is to say that, no matter how difficult we perceive a measurement problem to be, there is usually some level of uncertainty that can be reduced. We can always look to other fields when stumped – it’s likely our measurement problem occurs elsewhere, and others have made progress in resolving it. And even if we have no idea what to measure, it still makes sense to measure something. *The process itself will teach us what to measure*.

In business, observational data tends to be what is measured. And it is the first few observations that usually provide the highest payback in uncertainty reduction for a given amount of effort (**Diminishing Returns** set in quickly).

The most popular way for business analysts to measure something is through spreadsheets, mainly because they are easy to use. Being easy to use means spreadsheets have been quickly and widely adopted. It also means we tend to limit ourselves to solutions that are easy for a spreadsheet. This is a form of **Cognitive Bias** called the **Availability-Misweighing Tendency**. And familiarity with spreadsheets leads users to actually *over-use* them, or use them in places where they don’t belong. This leads to errors or incorrect conclusions. Even when used properly, spreadsheets are prone to human error; more than 20% of spreadsheets have errors, and as many as 5% of all calculated cells are incorrect.

Despite these drawbacks, spreadsheets play a critical role in the analytics process. Most managers and analysts use them to segue from raw data to final report. In fact, the ever-present nature of spreadsheets has helped to craft a new generation of thought. One rooted deeply in measurement and analysis.

It’s not that business analytics is a new phenomenon – it’s been around for the past 20 years. But it’s only recently making its breakthrough from a business perspective. Why? Well, this is how Malcolm Gladwell would likely explain it.

In Outliers, Gladwell did an excellent job of explaining how Bill Gates was perfectly positioned to pioneer the computer software industry. One of largest contributing factors was his age. This is the timeline Gladwell puts forth:

1955: Gates born

1964: First version of BASIC released

1965: First minicomputer (PDP-8) released

1967-1970: Ideal time for a future computer software maker to be in 7^{th} grade

We can use the same “Gladwell math” to back in to the ideal age for an analytics professional.

1979: VisiCalc released

1982: Lotus 1-2-3 released

1985: Excel released

1989-1991: Ideal time for a future analytics professional to be in 7^{th} grade

This would put the ideal analytics professional in his or her mid-thirties today, usually about the time that people reach some level of authoritative decision-making status. It’s reasonable for these professionals to remain in decision-making positions for the next generation, meaning we are likely just now entering a prolonged period of applied analytics in business.

]]>Books are what give us this theory structure, and certainly the quality of the theory structure we begin with impacts the amount of deliberate practice we need to become an ‘expert.’ So it’s important to choose the right books, as they will provide the base infrastructure upon which we will layer our experiences. We’re looking for books that concisely capture the overriding concepts of a particular discipline.

And in any discipline, at least one fairly well defined, there doesn’t need to be that many books to accomplish this. I would generally say that 3 books or fewer, for each discipline, will give you a proper theory structure.

With that in mind, let’s look at Statistics…

Statistics is the study of how to collect, organize, analyze, and interpret numerical information from data. The most basic task when working with data is to summarize a great deal of information. And in our attempts to summarize information, we infer things. In other words, statistics is not out to prove anything with certainty. It is, however, out to increase our understanding of something, and it usually does that through **Statistical Inference**. We gain statistical inference from observing some pattern of outcome and then using **Probability** (the study of events and outcomes involving an element of uncertainty) to determine the most likely explanation for that outcome.

Because statistical analysis rarely unveils “the truth,” we are usually building a circumstantial case (based on imperfect data). We have to be ok with this messiness. After all, we live in a world of approximations and **Incompleteness**, as noted by Godel. So although the field of statistics is rooted in mathematics, and mathematics is exact, the use of statistics to describe phenomena is not exact.

Statistics as a subject is broken down into two branches: Descriptive Statistics and Inferential Statistics. It’s actually pretty straightforward. Descriptive statistics involves organizing, picturing and summarizing data. Inferential statistics involves using information from a sample to draw conclusions about the population. Both branches heavily involve the use of variables to explain the data, and two very important variables to differentiate are qualitative vs. quantitative.

Qualitative aspects of statistics invoke what is referred to as **Natural Inference**. Natural inference begins with an intention to explore a particular area, collects “data” (i.e. observations and interviews), and generates ideas and hypotheses from these data largely through what is known as **Inductive** reasoning. The strength of qualitative research lies in its validity (or closeness to the truth).

Quantitative aspects of statistics are generally referred to as Statistical Inference (mentioned above). In Statistical Inference, we should begin with an idea (usually called a hypothesis) which, through measurement, generates data allowing us to make a conclusion (in other words, we should use the **Scientific Method**). The strength of the quantitative approach lies in its reliability (or repeatability).

In our attempts to describe data, we usually start with those descriptions that are easiest to understand. These means we start with what are called measures of center (or central tendency). In order for someone to get an initial understanding of the data, these are usually the first descriptions offered. They include mean, median, and mode. For a more exhaustive explanation of these measures, it’s best to turn to Sal Kahn.

Measures of center can give us some good introductory information about our data, but we can further explore the data through measures of spread (or dispersion). In other words, how dispersed are the data points – how far away from each other are they? Variance and standard deviation are the two most commonly used measures of spread, and they are interrelated.

Variance is calculated by 1) squaring the distance between each observation and the mean, and then 2) taking the average of all squared distances. Because the difference between each observation and the mean is squared, the formula for calculating variance puts particular weight on observations that are far from the mean, or outliers.

The standard deviation for a set of observations is the square root of the variance. So basically, a more compact version of the variance – it shows dispersion at a scale lower than variance.

Between measures of center and spread, we’ve pretty much covered descriptive statistics. We can now move on to inferential statistics. The first topic to cover in inferential statistics is regression, which will expand upon the measures of spread just discussed in descriptive statistics.

Regression is a measure of association. It’s an attempt to describe an entire set of data with one single line: the line that “best fits” the linear relationship between a set of variables. In order to understand regression, we have to first understand correlation.

Correlation measures the degree to which two phenomena are related to one another. In order to fully understand correlation though, we have to understand *why* we might observe a correlation between phenomena. There are three reasons why this could happen:

- There is a logical relationship between the two (or more) variables (for example, interest rates and mortgage rates)
- There is another external factor affecting both variables (for example, weather during construction of a building will affect how long it takes to complete construction)
- The observed correlation occurred purely by chance and no correlation actually exists

Of these three, we care about the first reason – it demonstrates relatedness. And when we observe this relatedness (or correlation) between phenomena, we are able to make educated inferences. We do have to be careful to understand the difference between **Correlation and Causation**, though. Establishing correlation between two variables does not establish causation. These are two completely different concepts. We cannot know for certain that a relationship is causal (a change in one variable is causing a change in the other variable) simply because it is highly correlated.

Much like correlation, regression is an attempt to get predictive about data. Regression is a mathematical equation that allows one variable to be predicted from another variable. A regression analysis takes currently available data, translates it into a line that “best fits” the data, and then uses that line to predict future outcomes. Regression typically uses a methodology called ordinary least squares, which fits a line by minimizing the sum of the squared distances between each data point and the ideal line. Ordinary least squares gives us the best description of a linear relationship between two variables. If the squared errors are small, it tells us that the line is a really good fit. This goodness of fit is measured by a term called R-Squared.

Regression is predictive because it gives us an idea of what to expect from future data. If a new data point is introduced that is highly distinct from the “best fit” line, we expect additional data points to converge back towards the “best fit” line. This concept is often referred to as **Regression to the Mean**, which is the idea that things even out over time.

Regression is a very important part of inferential statistics, but the **Normal Distribution** (often called the “bell curve”) is at the core of almost everything we do in inferential statistics. In fact, the normal distribution is arguably the most important concept in all of statistics.

The power of the normal distribution lies in the fact that it’s a parametric test. All statistical tests are either parametric (i.e. they assume that the data were sampled from a particular form of distribution, such as a normal distribution) or non-parametric (i.e. they do not assume that the data were sampled from a particular type of distribution). In general, parametric tests are more powerful than non-parametric ones and so should be used if at all possible.

The beauty of the normal distribution is that we know by definition exactly what proportion of the observations in a normal distribution lie within one standard deviation of the mean (68.2%), within two standard deviations of the mean (95.4%), within three standard deviations (99.7%).

The wide applicability of the normal distribution is based on the central limit theorem. The core principle underlying the central limit theorem is that a large, properly drawn sample will resemble the population from which it is drawn. The easiest way to gather a representative sample of a larger population is to select some subset of that population randomly.

It turns out that the normal distribution describes many common phenomena. However, when working with a small sample of data, the normal distribution is not helpful. In these situations, the t-distribution makes more sense. Another way to think about the applicability of the normal distribution vs. the t-distribution is through Degrees of Freedom. Degrees of Freedom are roughly equal to the number of observations in a sample; the more degrees of freedom we have, the more confident we can be that our sample represents the true population, and the “tighter” our distribution will be. When the number of degrees of freedom gets large, the t-distribution converges to the normal distribution.

While the normal distribution is the core of almost everything we do in inferential statistics, any statistical inference must begin with a hypothesis. In statistics, the hypothesis that we begin with is called the null hypothesis. This is our starting assumption (think of it as the status quo). If we reject the null hypothesis, then we typically accept some alternative hypothesis that is more consistent with the data observed.

When we perform a hypothesis test, a p-value (probability value) helps us determine the significance of our results. We use this p-value to determine whether to accept or reject the null hypothesis. We can set the p-value to various levels, but the most common one is 5%, which is considered to be “statistically significant.” A p-value less than this amount indicates strong evidence against the null hypothesis, which means we reject it and accept the alternative. In statistics-speak, a significance level of 5% means we can “reject a null hypothesis at the .05 level if there is less than a 5% chance of getting an outcome at least as extreme as what we’ve observed if the null hypothesis were true.”

If a significance level of 5% seems somewhat arbitrary, that’s because it is. There is no single standardized statistical threshold for rejecting a null hypothesis. Significance levels of both 1% (deemed “statistically highly significant”) and 10% are also reasonably common thresholds for doing this kind of analysis.

The discussion on p-values demonstrates that we are working with probabilities, not certainties. And in our efforts to test our hypotheses, it is important to account for the possibility of error. In hypothesis testing, we care about two errors: Type I and Type II.

A Type I Error is wrongly rejecting a null hypothesis (a false positive, or wrongly classifying legitimate email as spam). In other words, deciding the null hypothesis is false, when in fact it is true; declaring a difference, when in fact no difference exists. A Type II Error is failing to reject when in fact the null hypothesis is false (a false negative, or letting spam into the inbox).

All these concepts and processes can seem mind-bending at first, but as we work through them (slowly), they start to make sense. After enough practice, it really does become routine. Of course, at that point, we can move on to non-normal distributions (which should be saved for another day). I will say that Chris Whelan does about as good a job as I’ve seen in describing statistics in a simple-to-understand manner in his book Naked Statistics. It’s worth a read in order to become more conversant in Statistics.

Now, a discussion on statistics would be remiss without mention of the numerous **Cognitive Biases** that plague our ability to perform good statistics. Most notable in this group are the following:

- Selection Bias (a form of the
**Reward and Punishment Superreponse Tendency**): a bias towards selecting a non-random sample - Publication Bias (a form of the Reward and Punishment Superreponse Tendency): a bias towards what is likely to be published
- Recall Bias (a form of the
**Availability-Misweighing Tendency**): a tendency to better recall what is perceived to be more important - Survivorship Bias (a form of the Availability-Misweighing Tendency): the logical error of concentrating on the people or things that “survived” some process and inadvertently overlooking those that did not

Given all we’ve just discussed, statistics can seem unwieldy at times. But statistics is really just the science of likelihood, and we use it every day to reduce uncertainty. It gets a bit more complicated at advanced levels, but the realities remain the same. When used properly, statistics helps us gain new insight or knowledge.

]]>**Comparative Advantage****Maximizing Non-Egality****Checklisting**

**Comparative Advantage:** If an organization is employing managers, it has likely reached a point where time has become precious to a series of individuals. Time is **Inelastic** – no matter how high the demand is, the supply is limited. The limited amount of time that we each have necessitates delegation. And delegation is the end result of Comparative Advantage. Just because one person can do three things better than everyone else doesn’t mean *that person* should be doing all three things. Delegating one or two of them may yield a greater or better total result. This is the idea of Comparative Advantage.

**Maximizing Non-Egality:** Organizations are filled with a variety of different people. Each person has a unique set of strengths and weaknesses. In order to make those strengths productive (and those weaknesses irrelevant), managers Maximize Non-Egality. In other words, managers don’t treat everyone equally. This is because, logically enough, everyone *isn’t* equal. Those who outperform in certain areas are *better in that area*. And in order to yield the greatest results, it’s best to maximize inequality by only assigning the best at each task to that task. Legendary UCLA coach John Wooden was famous for this on the basketball floor. He also said that it was his commitment to preparation that forced him to look honestly at the weaknesses and faults of his team, as well as focus on the long-term goals and dreams. So in the way of Wooden, in order to maximize non-egality, preparation is an absolute requirement.

It is important to consider how certain aspects of **Cognitive Misjudgment** impact maximizing non-egality as well. The **Use-It-Or-Lose-It Tendency**, for example, reinforced maximizing non-egality**.** The Use-It-Or-Lose-It Tendency is the simple idea that use of a skill increases ability in that skill, and non-use decreases ability in that skill.** **So this tendency allows strengths to stay strong (through using them), maximizing non-egality. Another aspect of cognitive misjudgment that eventually impacts maximizing non-egality is the **Senesence-Misinfluence Tendency**. This is the idea that after a certain age, our skills attenuate. And despite the obvious implications of ageism, there is a point at which age becomes a factor in ability. This must also be considered.

**Checklisting**: The overt challenge most managers face is in prioritization and execution. Because of the way we (humans) think, writing every task down in Checklist fashion allows for better prioritization and execution. The act of Checklisting often leads to the ability to group similar tasks, and **Batch** them when appropriate. Appropriate Batching can lead to **Parallel Processing** (this is not Multi-tasking – it is single tasking to stopping points). All of this leads to more work getting done.

These three mental models will account for the bulk of results in effective management. In addition however, every manager will derive some additional benefit from the following Mental Models:

**Maslow’s Hierarchy of Needs:**Managers must consider the inherent nature of human beings. Abraham Maslow’s Hierarchy of Needs sums this up quite nicely. Maslow himself elaborated on the idea: “The best managers increase the health of the workers whom they manage. They do this via the gratification of basic needs for safety, for belongingness, for affectionate relationships and friendly relationships with their informal groups, prestige needs, needs for self-respect, etc.” The fear of not having these things can play a major role in creating passionately dedicated employees.**Feedback Loops:**Managers essentially act as a feedback loop, relaying information through an organization. Understanding how to boost certain informational relays, and mute others, is critical to a properly functioning organization.**Systems and Constraints:**Managers are responsible for removing barriers and obstacles (or Constraints). The job of the manager is an enabling one, not a directive one – it’s coaching, not mandating.**Symbiosis:**An organization is comprised of many different types of people. The manager is responsible for creating a symbiotic environment, where everyone lives and works together in harmony. Most people just call this culture.

There it is: a synopsis of the Mental Models that dictate great management.

]]>Books are what give us this theory structure, and certainly the quality of the theory structure we begin with impacts the amount of deliberate practice we need to become an ‘expert.’ So it’s important to choose the right books, as they will provide the base infrastructure upon which we will layer our experiences. We’re looking for books that concisely capture the overriding concepts of a particular discipline.

And in any discipline, at least one fairly well defined, there doesn’t need to be that many books to accomplish this. I would generally say that 3 books or fewer, for each discipline, will give you a proper theory structure.

With that in mind, let’s look at Ethics…

It’s best to start by stating the obvious: no amount of classroom work, exercises, or books can make someone ethical. It’s a choice. But as Ben Franklin said, it’s the best policy. So approaching ethics from a practical standpoint can make just as much sense as approaching it from a moral standpoint.

The practical standpoint is what we all refer to as teleological (or Utilitarianism). This is a **Relative** approach to ethics (a certain behavior may or may not be wrong depending on the circumstances). The moral standpoint is what we all refer to as deontological (or Stoicism, to an extent). This is an **Absolute** approach to ethics (a certain behavior is always right or always wrong). Most of us will fall somewhere between these two extremes.

And although the single best way to teach ethics is by example, a good foundational framework can be found in **Cicero On Duties**. A lot of the ethical thought we use today is in part thanks to Cicero’s ability to condense and clearly write on the topic 2,000 years ago.

Cicero outlines two systems of ethical philosophy: Utilitarianism and Stoicism.

- The Utilitarian theory of morals makes virtue a means. In other words, we are to practice virtue for the good that will come of it to ourselves and others. This is often called the selfish theory of morals, as it makes the pursuit of our own happiness our duty. Any and all adaptations to this end are the sole standard of what is ‘right.’ ‘Right,’ in this sense, changes according to circumstances, and has no real attributes that are uniquely its own. This was the Epicurean of Cicero’s time.

More recently, John Stuart Mill became the foremost publicist of this philosophy insisting that “pleasure, and freedom from pain, are the only things desirable as ends.” Mill wrote “In the golden rule of Jesus of Nazareth, we read the complete spirit of ethics of utility. To do as one would be done by, and to love one’s neighbor as oneself, constitute the ideal perfection of utilitarian morality.” Simply put, Utilitarianism seeks to make everyone better off through maximizing the world’s total happiness. The alternative would be unconditional compassion. And in a world where no one gets punished, bad behavior will grow. - The Stoic theory of morals makes virtue an end. In other words, we are to practice virtue for its own sake, for the intrinsic benefit it gives us, regardless of any ulterior consequences. In this theory, ‘right’ has the same meaning regardless of circumstances, time, place, judgment, or any other external factor. ‘Right’ is indelible. What’s right is right. Period. This philosophy, from Cicero’s time till Christianity gained ascendency, is credited with preserving Roman Society from remediless corruption.

The middle ground (between Utilitarianism and Stoicism) was represented by the Peripatetics. This is where Cicero found himself. Peripatetics were a more practical group, believing that morality acts in accordance with probability: between two courses of action, pursue the one for which the more and better reasons can be given. Cicero favored this group over the more rigid system of the Stoics, but he had great sympathy for Stoic thought.

Three waves of ethical thought preceded Cicero (106 B.C. – 43 B.C.). The final one (Stoicism) heavily weighed on Cicero’s written work and beliefs.

- Taoism (Lao Tzu, ~550 B.C.): As far as we know, the principle of returning good for evil was first enunciated by Lao Tzu. In other words, “turn the other cheek.” Confucius rejects this as vain idealism.
- Confucianism (Confucius, ~500 B.C.): The essence of Confucianism is the ubiquitous phrase “do unto others as we would have them do to us.” In other words, the moral life consists in being true to oneself and good to one’s neighbor. Understanding human nature, Confucius promoted positive thinking to maintain moral balance. He believed the instincts of man are social and therefore fundamentally good. So he stressed altruism, acting socially, and living for others in living for oneself.
- Stoicism (Zeno, ~330 B.C.): Stoicism is about defining those things that we can control and those that we can’t, and only focusing on or caring about those things that we can control. Naturally, this requires a deeper understanding and management of our emotions – specifically negative emotions. The goal of the Stoics was not to banish emotion from life but to banish negative emotion. Seneca (~30 A.D.) advises us to rid ourselves of fear by limiting our desires, and Epictetus (~100 A.D.) advises us to rid ourselves of envy by taking happiness in not desiring things. An inability to rid ourselves of fear and envy can result in the
**Disliking/Hating Tendency**, which distorts our ability to see things clearly and make good decisions. Ultimately, Stoicism is not so much concerned with right and wrong, but with living a good, happy life. And this means developing reason at the expense of emotion. As a byproduct, Stoics believed that what is right ought to be sought chiefly for its own sake.

It’s clear that humans have been pushing for some kind of moral code for a very long time. Why? The logical answer is that everyone’s happiness can, in principle, go up if everyone treats everyone else nicely. You refrain from cheating or mistreating me, I refrain from cheating or mistreating you; we’re both better off than we would have been in a world without morality. This is the idea of **Virtue Effects**, or the idea that good perpetuates good and vice versa. In this kind of world, mutual mistreatment would probably cancel itself out, but we’d still be left with the added cost of fear.

A good system of ethics is really about supporting psychological health, and many religions are described at some level as ideologies of exactly that. Psychological health is bolstered by friendship, affection, and trust. These traits, long before people signed contracts, were what held human societies together. Fear, avarice, and envy, on the other hand, were and still are what tear human societies apart. Ethical frameworks (and religions) are often about ridding us of these as best as is reasonably possible. Psychologically painful experiences can trigger a **Simple, Pain-Avoiding Psychological Denial Tendency**, in which the psychological pain is too great to bear. A good system of ethics can help to alleviate this pain. A poor system of ethics, on the other hand, may leave a person with the inability to deal with painful psychological experiences. This may lead to chemical dependency. In chemical dependency, morals usually break down horribly (**Drug-Misinfluence Tendency**), exacerbating the effects of an already poor ethical framework.

But after all the talk, the single best way to teach ethics is still by example. Seeing someone you respect acting properly in a stressful situation will have far more impact than any formal teachings. So Ethics is best learned **Indirectly** (through observing others) than **Directly** (through formal reading and lectures).

Without the predictive ability to determine success, it’s better to develop a system of thought. And the best system of thought is a latticework of mental models. **Induction**, or the prediction of unobserved events from knowledge of observed (similar) ones, is the result of a well constructed latticework. And induction is the ability to go from specific concepts to a multidisciplinary understanding of the world. This ability is critical in entrepreneurship.

There are three mental models that should constantly be on the forefront of every entrepreneur’s mind. These same three mental models often explain how business value is created or destroyed.

**Scaling**(borrowed from Mathematics): Most importantly, how does scale change the behavior of every aspect of the organization? How does it change the culture, the work ethic, the cost structure, etc.? As a business scales (or grows), it passes through**Phase Changes**. A three person business is markedly different from a 20 person business, and a 20 person business is markedly different from a 250 person business. At certain levels (of revenue, of employees, of customers), the organization will inherently change in the way it’s structured. In Chemistry-speak, the chemical makeup will change. We need to know how scaling impacts structure.**Systems & Constraints**(Borrowed from Physics): These are everywhere. As we install systems, they inherently are accompanied by constraints. As we scale, we face new constraints. And the systems to accommodate the new constraints may be lagging, further burdening an organization with more constraints. We need to know how systems may break, and how and where constraints may limit.**Competitive Advantage**(Borrowed from Economics): This can come in a variety of different flavors. But in order to avoid the Ravages of Commoditization, we need some form of competitive advantage.

Through **Synthesis**, we can combine the impact of these 3 mental models to provide a workable thought framework. The world is multidisciplinary. No one science, soft or hard, will explain everything. So we’re better off crossing boundaries and combining what we find into workable thought frameworks. **Conversion** is how the laws of different disciplines intersect to create something new. And in Entrepreneurship, it’s the conversion of concepts from Mathematics, Physics, and Economics that form the foundation of the discipline.

With that in mind, let’s look at Innovation…

The world is constantly changing, and the requirements for a sustainable organization are often changing along with it. Innovation, in the simplest sense, is acknowledging that this change exists and adapting accordingly.

Every organization engages in a constant battle between **Competitive Advantage** and **Competitive Destruction**. When competitive advantage consistently outpaces competitive destruction, we find **Sustainability**. When it’s the other way around, we find obsolescence. So being on the right side of these forces is crucial, and innovation is often what allows that to happen. Organizations are comprised of individuals, so innovation starts there.

But an innovative or creative spirit can be awfully difficult to maintain throughout life. Abraham Maslow, in Maslow on Management, tells of an interesting study conducted by Harvard researchers. The researchers set out to measure IQ, spatial, visual, social, and emotional intelligence of infants and young children. The researchers found that up to age four, the young children were up to the genius level. After age four, through the development process, their scores were lower. What Maslow took from this research was that after age four, we get messages (parental, societal, etc.) which cover up our own natural tendencies toward creativity. We continuously get messages on how to solve problems – do it this way, don’t do it that way. And as a result, by the time we’re 35 or 40 our creativity is completely gone. Steve Jobs, although an anomaly himself, had similar thoughts. He thought it was rare to see an artist in his 30s or 40s able to really contribute something amazing.

This doesn’t have to be a uniform truth, though. There are ways to stomp out this creative or innovative regression, and it often hinges on our ability to overcome a variety of cognitive biases. Three cognitive forces combine to wreak havoc on either an individual’s or an organization’s ability to innovate.

**Availability-Misweighing Tendency:**Any department or organization, left to itself, will recruit people of the same training and habits as themselves. This is because we tend to place greater emphasis on what’s understandable, which would be people very much like ourselves.**Inconsistency-Avoidance Tendency:**A group of similar people will become more and more unified in thought, unwilling to break away from long held habits or methods. This is because we have a tendency to maintain a consistent course or behavior.**Contrast-Misreaction Tendency:**The world changes almost imperceptibly slow, so it is easy for someone to lull himself into the comfort that long-held habits and methods are correct. Gradual change is hard to notice, but the net result of a series of gradual changes is hard to ignore. The tendency not to notice gradual change further stunts an innovative mindset.

In order to fight these tendencies, organizations should bake another natural tendency into their culture: the **Curiosity Tendency**. Infusing curiosity into an organization happens when new people, unhampered by tradition, are continually moving into an organization. There is continuous re-examining and questioning of practices. Constructive ideas for improvement or innovation are bound to happen.

As usual, understanding a few basic Mental Models (Competitive Advantage, Competitive Destruction, and Sustainability), coupled with a slightly more in-depth understanding of **Cognitive Misjudgment,** offers a very strong foundation in Innovation. But since so many minds have weighed in on the subject, it makes sense to consider other perspectives (other than mine, that is) as well.

There are quite a few people who have made insightful contributions to the study of innovation (through mental models). Benjamin Franklin thought that innovation (or science) should be pursued for pure fascination, espousing the **Curiosity Tendency**. Management consulting guru Peter Drucker famously said: “There are two functions, and two functions only, of any business: innovation and marketing. Marketing and innovation produce results; all the rest are costs.” Drucker viewed innovation as the fuel of **Sustainability**. Eric Ries, credited with pioneering the Lean Startup movement, had similar thoughts: “I believe a company’s only **Sustainable** path to long-term economic growth is to build an innovation factory that uses Lean Startup techniques to create disruptive innovations on a continuous basis.” Robert Noyce, co-founder of Intel, asked only two questions in the earliest stages of scientific innovation: “Why won’t this work?” and “What fundamental laws will it violate?” Noyce embraced the idea of **Disconfirmation**.

Other minds have offered insight into the economics and organizational dynamics behind innovation. Clayton Christensen, of The Innovator’s Dilemma fame, has noticed that as soon as an established technology begins to improve at a decreasing rate, a new technology may emerge to supplant the existing one. Steve Jobs relayed the following thoughts on innovation in Walter Issacson’s book about him: “I have my own theory why decline happens at companies like IBM or Microsoft. The company does a great job, innovates and becomes a monopoly or close to it in some field, and then the quality of the product becomes less important. The company starts valuing the great salesmen, because they’re the ones who can move the needle on revenues, not the product engineers and designers. So the salespeople end up running the company. John Akers at IBM was a smart, eloquent, fantastic salesperson, but he didn’t know anything about the product. The same thing happened at Xerox. When the sales guys run the company, the products guys don’t matter so much, and a lot of them just turn off. It happened at Apple when Sculley came in, which was my fault, and it happened when Ballmer took over at Microsoft. Apple was lucky and it rebounded, but I don’t think anything will change at Microsoft as long as Ballmer is running it.”

But the one guy who best captures the nature of innovation is Steven Berlin Johnson in * Where Good Ideas Come From: The Natural History of Innovation*. Johnson basically breaks down innovation into three steps: identification, environment, and proliferation.

**Identification:**Although we cannot predict innovation, we can use**Fractals**to spot it. When life gets creative, it has a tendency to gravitate toward certain recurring patterns. These patterns are fractal, reappearing in recognizable form, but at different scales. The patterns remain the same even as the scale of observation changes. And they scale up or down by a specific amount in accordance with a precise, measurable formula.**Environment:**Innovative systems have a tendency to gravitate toward the “edge of chaos”: the fertile zone between too much order and too much complexity. In chemistry-speak, instead of a gas or a solid, innovation thrives in a liquid environment. This allows new configurations to emerge through random connections. Jobs designed the Apple office space along these lines, as he believed that creativity came from spontaneous meetings and random discussions. “You run into someone, you ask what they’re doing, you say ‘Wow,’ and soon you’re cooking up all sorts of ideas.” Carl F. Braun bought into this as well, subsidizing 1/3 of the cost of lunch at his in-house cafeteria. He knew he needed to find a way to keep his employees from different sections interacting, and this was his solution. In the*Act of Creation*, Arthur Koestler argued that “all decisive events in the history of scientific thought can be described in terms of mental cross-fertilization between different disciplines.” In other words, Charlie Munger’s concept of worldly wisdom, or a cursory understanding of many disciplines through the main concepts that govern them, is the overriding paradigm for effective innovation. Many of history’s great innovator’s managed to build a cross-disciplinary coffeehouse environment within their own private work routines. In fact, creativity and innovation are governed by the inverse of a quarter-power law known as**Kleiber’s Law**: as cities get bigger, they generate ideas at a faster clip. The more people, the more cross-functional interactions, the more innovation and creation.**Proliferation:****Quality Control**may be great in manufacturing, but innovative environments suffer under the demands of it. The secret to organizational inspiration is to build information networks that allow hunches to persist and disperse and recombine. Good ideas are more likely to emerge in environments that contain a certain amount of noise and error. Innovative environments thrive on useful mistakes**.**

Now, all organizations face **Constraints**. And all organizations face change. So the challenge for any organization is to foster an environment of innovation within those constraints. In the end, it is the job of business to convert change into innovation.

Unfortunately, there’s no consensus on what the ideal team size should be. This is probably because there just simply isn’t one. And that, of course, is fine. But everyone seems to have an opinion on what’s best.

Steve Jobs liked to keep his teams to no more than 100 people so that he could remember names; Peter Drucker said teams work best, as a rule, if they have three or four members (and should normally not exceed five or six); Google likes to limit teams to a max of six people; 37Signals thinks three people is the optimal team size for a product release; Reid Hoffman (of LinkedIn) would likely refer to Dunbar’s Number to substantiate groups of up to 150. And the list could go on… Does this mean that teams are effective at any size between three and 150 members? It’s more likely that this simply means teambuilding is a situational exercise, and nothing more.

Although teambuilding is situational, most studies of effective teamwork usually recommend working in groups of between three to eight people. The primary reason for smaller groups is to accommodate communication ‘overhead,’ or the proportion of time spent communicating with team members instead of getting productive work done. As communication overhead is increased, the opportunity for solo time is decreased, and this solo time can be critical in any process requiring creativity.

Small teams have requirements to be effective, though. Namely, precisely defined goals and easily monitored deadlines. Any team needs a clear and sharply defined objective and commitment towards achieving it. But in small teams, it must also be possible to **Feed Back** from the objectives to the work and performance of the whole team and of each member.

So although there are no absolutes for designing teams, the following concepts (or mental models) act as guides in the creative process:

**Metcalfe’s Law:**A corollary to Metcalfe’s Law is that the efficiency of a team is approximately the inverse of the square of the number of members in the team. In other words, a team of three people operates at 11% (1/9) of the efficiency that an individual would operate. This is simply a concept to keep in mind – it isn’t definite truth. Well designed teams will obviously operate at levels higher than 11% – otherwise, they would never be created.**Information Theory:**As the number of people on a project increases, so does the number of possible communication paths. And for every additional relay of information, the “message” is halved and the “noise” is doubled. So this simply means information relays ought to be limited, which can be addressed in the design phase of a team.**Comparative Advantage:**When groups of people with diverse problem-solving skills put their heads together, they often outperform groups of the smartest individuals. Diversity often trumps ability. Comparative advantage explains why diverse teams consistently outperform homogeneous teams.**Parkinson’s Law:**According to Cyril Parkinson, work expands so as to fill the time available for its completion. The larger a team, the more inherent lag in the system (in other words, the more wait time). Teams should be designed so as not to let work expand to fill the time available.**Systems & Constraints:**A team is a system of individuals. Like any system, the team should be structured to optimize the system as a whole, and not maximize one particular individual. This is usually done by successively removing bottlenecks (or constraints) that inhibit the ability to complete a task.**Bernoulli’s Theorem:**In any series of endeavors, the chances of succeeding are reduced by 50% with each successive attempt. This can be used as a Feedback mechanism for the quality of an assembled team. After three failed attempts with the same team, it’s likely time to disband.**Arithmetic:**Very basic, I know. But, elementary math ability is required here when looking at communications relays.

These mental models will help anyone think through the creation of a team. And although there may not be definitive rules on the proper size of a team, the effectiveness of a team usually comes down to information flow. The faster that relevant information can flow, the more able organizations are to organize effective small teams. Rapid information flow encourages teamwork and allows an organization to reconfigure itself quickly.

]]>