Statistica is now part of TIBCO. Beginning June 5th, Statistica customers will need to use TIBCO Support's customer portal, to receive assistance with Statistica.

# How To Predict Membership, Classification Trees

## Basic Ideas

Classification trees are used to predict membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables. Classification tree analysis is one of the main techniques used in Data Mining.

The goal of classification trees is to predict or explain responses on a categorical dependent variable, and as such, the available techniques have much in common with the techniques used in the more traditional methods of Discriminant Analysis, Cluster Analysis, Nonparametric Statistics, and Nonlinear Estimation. The flexibility of classification trees make them a very attractive analysis option, but this is not to say that their use is recommended to the exclusion of more traditional methods. Indeed, when the typically more stringent theoretical and distributional assumptions of more traditional methods are met, the traditional methods may be preferable. But as an exploratory technique, or as a technique of last resort when traditional methods fail, classification trees are, in the opinion of many researchers, unsurpassed.

What are classification trees? Imagine that we want to devise a system for sorting a collection of coins into different classes (perhaps pennies, nickels, dimes, quarters). Suppose that there is a measurement on which the coins differ, say diameter, which can be used to devise a hierarchical system for sorting coins. We might roll the coins on edge down a narrow track in which a slot the diameter of a dime is cut. If the coin falls through the slot it is classified as a dime, otherwise it continues down the track to where a slot the diameter of a penny is cut. If the coin falls through the slot it is classified as a penny, otherwise it continues down the track to where a slot the diameter of a nickel is cut, and so on. We have just constructed a classification tree. The decision process used by our classification tree provides an efficient method for sorting a pile of coins, and more generally, can be applied to a wide variety of classification problems.

The study and use of classification trees are not widespread in the fields of probability and statistical pattern recognition (Ripley, 1996), but classification trees are widely used in applied fields as diverse as medicine (diagnosis), computer science (data structures), botany (classification), and psychology (decision theory). Classification trees readily lend themselves to being displayed graphically, helping to make them easier to interpret than they would be if only a strict numerical interpretation were possible.

Classification trees can be and sometimes are quite complex. However, graphical procedures can be developed to help simplify interpretation even for complex trees. If one's interest is mainly in the conditions that produce a particular class of response, perhaps a High response, a 3D Contour Plot can be produced to identify which terminal node of the classification tree classifies most of the cases with High responses.

In the example illustrated by this 3D Contour Plot, we could "follow the branches" leading to terminal node 8 to obtain an understanding of the conditions leading to High responses.

Amenability to graphical display and ease of interpretation are perhaps partly responsible for the popularity of classification trees in applied fields, but two features that characterize classification trees more generally are their hierarchical nature and their flexibility.

For information on techniques and issues in computing classification trees, see Computational Methods. See also Exploratory Data Analysis and Data Mining Techniques.

## Characteristics of Classification Trees

### Hierarchical Nature of Classification Trees

Breiman et al. (1984) give a number of examples of the use of classification trees. As one example, when heart attack patients are admitted to a hospital, dozens of tests are often performed to obtain physiological measures such as heart rate, blood pressure, and so on. A wide variety of other information is also obtained, such as the patient's age and medical history. Patients subsequently can be tracked to see if they survive the heart attack, say, at least 30 days. It would be useful in developing treatments for heart attack patients, and in advancing medical theory on heart failure, if measurements taken soon after hospital admission could be used to identify high-risk patients (those who are not likely to survive at least 30 days). One classification tree that Breiman et al. (1984) developed to address this problem was a simple, three question decision tree. Verbally, the binary classification tree can be described by the statement, "If the patient's minimum systolic blood pressure over the initial 24 hour period is greater than 91, then if the patient's age is over 62.5 years, then if the patient displays sinus tachycardia, then and only then the patient is predicted not to survive for at least 30 days." It is easy to conjure up the image of a decision "tree" from such a statement. A hierarchy of questions are asked and the final decision that is made depends on the answers to all the previous questions. Similarly, the relationship of a leaf to the tree on which it grows can be described by the hierarchy of splits of branches (starting from the trunk) leading to the last branch from which the leaf hangs. The hierarchical nature of classification trees is one of their most basic features (but the analogy with trees in nature should not be taken too far; most decision trees are drawn downward on paper, so the more exact analogy in nature would be a decision root system leading to the root tips, hardly a poetic image).

The hierarchical nature of classification trees is illustrated by a comparison to the decision-making procedure employed in Discriminant Analysis. A traditional linear discriminant analysis of the heart attack data would produce a set of coefficients defining the single linear combination of blood pressure, patient age, and sinus tachycardia measurements that best differentiates low risk from high risk patients. A score for each patient on the linear discriminant function would be computed as a composite of each patient's measurements on the three predictor variables, weighted by the respective discriminant function coefficients. The predicted classification of each patient as a low risk or a high risk patient would be made by simultaneously considering the patient's scores on the three predictor variables. That is, suppose P (minimum systolic blood Pressure over the 24 hour period), A (Age in years), and T (presence of sinus Tachycardia: 0 = not present; 1 = present) are the predictor variables, p, a, and t, are the corresponding linear discriminant function coefficients, and c is the "cut point" on the discriminant function for separating the two classes of heart attack patients. The decision equation for each patient would be of the form, "if pP + aA + tT - c is less than or equal to zero, the patient is low risk, else the patient is in high risk."

In comparison, the decision tree developed by Breiman et al. (1984) would have the following hierarchical form, where p, a, and t would be -91, -62.5, and 0, respectively, "If p + P is less than or equal to zero, the patient is low risk, else if a + A is less than or equal to zero, the patient is low risk, else if t + T is less than or equal to zero, the patient is low risk, else the patient is high risk." Superficially, the Discriminant Analysis and classification tree decision processes might appear similar, because both involve coefficients and decision equations. But the difference of the simultaneous decisions of Discriminant Analysis from the hierarchical decisions of classification trees cannot be emphasized enough.

The distinction between the two approaches can perhaps be made most clear by considering how each analysis would be performed in Regression. Because risk in the example of Breiman et al. (1984) is a dichotomous dependent variable, the Discriminant Analysis predictions could be reproduced by a simultaneous multiple regression of risk on the three predictor variables for all patients. The classification tree predictions could only be reproduced by three separate simple regression analyses, where risk is first regressed on P for all patients, then risk is regressed on A for patients not classified as low risk in the first regression, and finally, risk is regressed on T for patients not classified as low risk in the second regression. This clearly illustrates the simultaneous nature of Discriminant Analysis decisions as compared to the recursive, hierarchical nature of classification trees decisions, a characteristic of classification trees that has far-reaching implications.

### Flexibility of Classification Trees

Another distinctive characteristic of classification trees is their flexibility. The ability of classification trees to examine the effects of the predictor variables one at a time, rather than just all at once, has already been described, but there are a number of other ways in which classification trees are more flexible than traditional analyses. The ability of classification trees to perform univariate splits, examining the effects of predictors one at a time, has implications for the variety of types of predictors that can be analyzed. In the Breiman et al. (1984) heart attack example, blood pressure and age were continuous predictors, but presence of sinus tachycardia was a categorical (two-level) predictor. Even if sinus tachycardia was measured as a three-level categorical predictor (perhaps coded as 0 = not present; 1 = present; 3 = unknown or unsure), without any underlying continuous dimension represented by the values assigned to its levels, univariate splits on the predictor variables could still be easily performed. Additional decisions would be added to the decision tree to exploit any additional information on risk provided by the additional category. To summarize, classification trees can be computed for categorical predictors, continuous predictors, or any mix of the two types of predictors when univariate splits are used.

Traditional linear discriminant analysis requires that the predictor variables be measured on at least an interval scale. For classification trees based on univariate splits for ordinal scale predictor variables, it is interesting that any monotonic transformation of the predictor variables (i.e., any transformation that preserves the order of values on the variable) will produce splits yielding the same predicted classes for the cases or objects (if the C&RT-style univariate split selection method is used, see Breiman et al., 1984). Therefore, classification trees based on univariate splits can be computed without concern for whether a unit change on a continuous predictor represents a unit change on the dimension underlying the values on the predictor variable; it need only be assumed that predictors are measured on at least an ordinal scale. In short, assumptions regarding the level of measurement of predictor variables are less stringent.

Classification trees are not limited to univariate splits on the predictor variables. When continuous predictors are indeed measured on at least an interval scale, linear combination splits, similar to the splits for linear discriminant analysis, can be computed for classification trees. However, the linear combination splits computed for Classification Trees do differ in important ways from the linear combination splits computed for Discriminant Analysis. In linear discriminant analysis the number of linear discriminant functions that can be extracted is the lesser of the number of predictor variables or the number of classes on the dependent variable minus one. The recursive approach implemented for Classification Trees module does not face this limitation. For example, dozens of recursive, linear combination splits potentially could be performed when there are dozens of predictor variables but only two classes on the dependent variable. This compares with the single linear combination split that could be performed using traditional, non-recursive linear discriminant analysis, which could leave a substantial amount of the information in the predictor variables unused.

Now consider the situation in which there are many categories but few predictors. Suppose we were trying to sort coins into classes (perhaps pennies, nickels, dimes, and quarters) based only on thickness and diameter measurements. Using traditional linear discriminant analysis, at most two linear discriminant functions could be extracted, and the coins could be successfully sorted only if there were no more than two dimensions represented by linear combinations of thickness and diameter on which the coins differ. Again, the approach implemented for Classification Trees does not face a limitation on the number of linear combination splits that can be formed.

The approach implemented for Classification Trees for linear combination splits can also be used as the analysis method for constructing classification trees using univariate splits. Actually, a univariate split is just a special case of a linear combination split. Imagine a linear combination split in which the coefficients for creating the weighted composite were zero for all predictor variables except one. Since scores on the weighted composite would depend only on the scores on the one predictor variable with the nonzero coefficient, the resulting split would be a univariate split.

The approach implemented for Classification Trees for the Discriminant-based univariate split selection method for categorical and ordered predictors and for the Discriminant-based linear combination split selection method for ordered predictors is an adaption of the algorithms used in QUEST (Quick, Unbiased, Efficient Statistical Trees). QUEST is a classification tree program developed by Loh and Shih (1997) that employs a modification of recursive quadratic discriminant analysis and includes a number of innovative features for improving the reliability and efficiency of the classification trees that it computes.

The algorithms used in QUEST are fairly technical, but the Classification Trees module also offers a Split selection method option based on a conceptually simpler approach. The C&RT-style univariate split selection method is an adaption of the algorithms used in C&RT, as described by Breiman et al. (1984). C&RT (Classification And Regression Trees) is a classification tree program that uses an exhaustive grid search of all possible univariate splits to find the splits for a classification tree.

The QUEST and C&RT analysis options complement each other nicely. C&RT searches can be lengthy when there are a large number of predictor variables with many levels, and it is biased toward choosing predictor variables with more levels for splits, but because it employs an exhaustive search, it is guaranteed to find the splits producing the best classification (in the learning sample, but not necessarily in cross-validation samples).

QUEST is fast and unbiased. The speed advantage of QUEST over C&RT is particularly dramatic when the predictor variables have dozens of levels (Loh & Shih, 1997, report an analysis completed by QUEST in 1 CPU second that took C&RT 30.5 CPU hours to complete). QUEST's lack of bias in variable selection for splits is also a distinct advantage when some predictor variable have few levels and other predictor variables have many levels (predictors with many levels are more likely to produce "fluke theories," which fit the data well but have low predictive accuracy, see Doyle, 1973, and Quinlan & Cameron-Jones, 1995). Finally, QUEST does not sacrifice predictive accuracy for speed (Lim, Loh, & Shih, 1997). Together, the QUEST and C&RT options enable us to fully exploit the flexibility of classification trees.

### The Power and Pitfalls of Classification Trees

The advantages of classification trees over traditional methods such as linear discriminant analysis, at least in some applications, can be illustrated using a simple, fictitious data set. To keep the presentation even-handed, other situations in which linear discriminant analysis would outperform classification trees are illustrated using a second data set.

Suppose we have records of the Longitude and Latitude coordinates at which 37 storms reached hurricane strength for two classifications of hurricanes - Baro hurricanes and Trop hurricanes. The fictitious data shown below were presented for illustrative purposes by Elsner, Lehmiller, and Kimberlain (1996), who investigated the differences between baroclinic and tropical North Atlantic hurricanes.

DATA: Barotrop.sta 3v
LONGITUD LATITUDE CLASS
59.00
59.50
60.00
60.50
61.00
61.00
61.50
61.50
62.00
63.00
63.50
64.00
64.50
65.00
65.00
65.00
65.50
65.50
65.50
66.00
66.00
66.00
66.50
66.50
66.50
67.00
67.50
68.00
68.50
69.00
69.00
69.50
69.50
70.00
70.50
71.00
71.50
17.00
21.00
12.00
16.00
13.00
15.00
17.00
19.00
14.00
15.00
19.00
12.00
16.00
12.00
15.00
17.00
16.00
19.00
21.00
13.00
14.00
17.00
17.00
18.00
21.00
14.00
18.00
14.00
18.00
13.00
15.00
17.00
19.00
12.00
16.00
17.00
21.00
BARO
BARO
BARO
BARO
BARO
BARO
BARO
BARO
BARO
TROP
TROP
TROP
TROP
TROP
TROP
TROP
TROP
TROP
TROP
TROP
TROP
TROP
TROP
TROP
TROP
TROP
TROP
BARO
BARO
BARO
BARO
BARO
BARO
BARO
BARO
BARO
BARO

A linear discriminant analysis of hurricane Class (Baro or Trop) using Longitude and Latitude as predictors correctly classifies only 20 of the 37 hurricanes (54%). A classification tree for Class using the C&RT-style exhaustive search for univariate splits option correctly classifies all 37 hurricanes. The tree graph for the classification tree is shown below.

The headings of the graph give the summary information that the classification tree has 2 splits and 3 terminal nodes. Terminal nodes, or terminal leaves as they are sometimes called, are points on the tree beyond which no further decisions are made. In the graph itself, terminal nodes are outlined with dotted red lines, while the remaining decision nodes or split nodes are outlined with solid black lines. The tree starts with the top decision node, sometimes called the root node. In the graph it is labeled as node 1 in its top-left corner. Initially, all 37 hurricanes are assigned to the root node and tentatively classified as Baro hurricanes, as indicated by the Baro label in the top-right corner of the root node. Baro is chosen as the initial classification because there are slightly more Baro than Trop hurricanes, as indicated by the histogram plotted within the root node. The legend identifying which bars in the node histograms correspond to Baro and Trop hurricanes is located in the top-left corner of the graph.

The root node is split, forming two new nodes. The text below the root node describes the split. It indicates that hurricanes with Longitude coordinate values of less than or equal to 67.75 are sent to node number 2 and tentatively classified as Trop hurricanes, and that hurricanes with Longitude coordinate values of greater than 67.75 are assigned to node number 3 and classified as Baro hurricanes. The values of 27 and 10 printed above nodes 2 and 3, respectively, indicate the number of cases sent to each of these two child nodes from their parent, the root node. Similarly, node 2 is subsequently split. The split is such that the 9 hurricanes with Longitude coordinate values of less than or equal to 62.5 are sent to node number 4 and classified as Baro hurricanes, and the remaining 18 hurricanes with Longitude coordinate values of greater than 62.5 are sent to node number 5 and classified as Trop hurricanes.

The tree graph presents all this information in a simple, straightforward way, and probably allows us to digest the information in much less time than it takes to read the two preceding paragraphs. Getting to the bottom line, the histograms plotted within the tree's terminal nodes show that the classification tree classifies the hurricanes perfectly. Each of the terminal nodes is "pure," containing no misclassified hurricanes. All the information in the tree graph is also available in the tree structure spreadsheet shown below.

Tree Structure (barotrop.sta)
CLASSIF.
TREES
Child nodes, observed class n's,
predicted class, and split condition for each node

Node
Left
branch
Right
branch
n in cls
BARO
n in cls
TROP
Predict.
class
Split
constant
Split
variable
1
2
3
4
5
2
4

3
5

19
9
10
9
0
18
18
0
0
18
BARO
TROP
BARO
BARO
TROP
-67.75
-62.50

LONGITUD
LONGITUD

Note that in the spreadsheet, nodes 3 through 5 are identified as terminal nodes because no split is performed at those nodes. Also note the signs of the Split constants displayed in the spreadsheet, for example, -67.75 for the split at node 1. In the tree graph, the split condition at node 1 is described as LONGITUD 67.75 rather than as (the equivalent) -67.75 + LONGITUD 0. This is done simply to save space on the graph.

When univariate splits are performed, the predictor variables can be ranked on a 0 - 100 scale in terms of their potential importance in accounting for responses on the dependent variable. For this example, Longitude is clearly very important and Latitude is relatively unimportant.

A classification tree Class using the Discriminant-based univariate split selection method option produces similar results. The Tree structure spreadsheet shown for this analysis shows that the splits of -63.4716 and -67.7516 are quite similar to the splits found using the C&RT-style exhaustive search for univariate splits option, although 1 Trop hurricane in terminal node 2 is misclassified as Baro.

Tree Structure (barotrop.sta)
CLASSIF.
TREES
Child nodes, observed class n's,
predicted class, and split condition for each node

Node
Left
branch
Right
branch
n in cls
BARO
n in cls
TROP
Predict.
class
Split
constant
Split
variable
1
2
3
4
5
2

4

3

5

19
9
10
0
10
18
1
17
17
0
BARO
BARO
TROP
TROP
BARO
-63.4716

-67.7516

LONGITUD

LONGITUD

A categorized scatterplot for Longitude and Latitude clearly shows why linear discriminant analysis fails so miserably at predicting Class, and why the classification tree succeeds so well.

The plot clearly shows that there is no strong linear relationship of longitude or latitude coordinates with Class, or of any possible linear combination of longitude and latitude with Class. Class is not functionally related to longitude or latitude, at least in the linear sense. The LDF (Linear Discriminant Function) Split shown on the graph is almost a "shot in the dark" at trying to separate predicted Trop hurricanes (above the split line) from predicted Baro hurricanes (below the split line). The C&RT univariate splits, because they are not restricted to a single linear combination of longitude and latitude scores, find the "cut points" on the Longitude dimension that allow the best possible (in this case, perfect) classification of hurricane Class.

Now we can examine a situation illustrating the pitfalls of classification tree. Suppose that the following hurricane data were available.

DATA: Barotro2.sta 3v
LONGITUD LATITUDE CLASS
59.00
59.50
60.00
60.50
61.00
61.00
61.50
61.50
62.00
63.00
63.50
64.00
64.50
65.00
65.00
65.00
65.50
65.50
65.50
66.00
66.00
66.00
66.50
66.50
66.50
67.00
67.50
68.00
68.50
69.00
69.00
69.50
69.50
70.00
70.50
71.00
71.50
17.00
21.00
12.00
16.00
13.00
15.00
17.00
19.00
14.00
15.00
19.00
12.00
16.00
12.00
15.00
17.00
16.00
19.00
21.00
13.00
14.00
17.00
17.00
18.00
21.00
14.00
18.00
14.00
18.00
13.00
15.00
17.00
19.00
12.00
16.00
17.00
21.00
BARO
BARO
TROP
BARO
TROP
TROP
BARO
BARO
TROP
TROP
BARO
TROP
TROP
TROP
TROP
BARO
TROP
BARO
BARO
TROP
TROP
BARO
BARO
BARO
BARO
TROP
BARO
TROP
BARO
TROP
TROP
TROP
BARO
TROP
TROP
TROP
BARO

A linear discriminant analysis of hurricane Class (Baro or Trop) using Longitude and Latitude as predictors correctly classifies all 37 of the hurricanes. A classification tree analysis for Class using the C&RT-style exhaustive search for univariate splits option also correctly classifies all 37 hurricanes, but the tree requires 5 splits producing 6 terminal nodes. Which results are easier to interpret? In the linear discriminant analysis, the raw canonical discriminant function coefficients for Longitude and Latitude on the (single) discriminant function are .122073 and -.633124, respectively, and hurricanes with higher longitude and lower latitude coordinates are classified as Trop. The interpretation would be that hurricanes in the western Atlantic at low latitudes are likely to be Trop hurricanes, and that hurricanes further east in the Atlantic at higher latitudes are likely to be Baro hurricanes.

The tree graph for the classification tree analysis using the C&RT-style exhaustive search for univariate splits option is shown below.

We could methodically describe the splits in this classification tree, exactly as was done in the previous example, but because there are so many splits, the interpretation would necessarily be more complex than the simple interpretation provided by the single discriminant function from the linear discrimination analysis.

However, recall that in describing the flexibility of Classification Trees , it was noted that an option exists for Discriminant-based linear combination splits for ordered predictors using algorithms from QUEST. The tree graph for the classification tree analysis using linear combination splits is shown below.

Note that in this tree, just one split yields perfect prediction. Each of the terminal nodes is "pure," containing no misclassified hurricanes. The linear combination split used to split the root node into its left child node and right child node is summarized by the description "F(0) -.2342." This indicates that if a hurricane has a score of less than or equal to -.2342 on the split function - abbreviated as F(0) - then it is sent to the left child node and classified as Baro, otherwise it is sent to the right child node and classified as Trop. The split function coefficients (.011741 for Longitude and -.060896 for Latitude) have the same signs and are similar in their relative magnitude to the corresponding linear discriminant function coefficients from the linear discriminant analysis, so the two analyses are functionally identical, at least in terms of their predictions of hurricane Class.

The moral of this story of the power and pitfalls of classification trees is that classification trees are only as good as the choice of analysis option used to produce them. For finding models that predict well, there is no substitute for a thorough understanding of the nature of the relationships between the predictor and dependent variables.

We have seen that classification trees analysis can be characterized as a hierarchical, highly flexible set of techniques for predicting membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables. With this groundwork behind us, we now are ready to look at the methods for computing classification trees in greater detail.

For information on the basic purpose of classification trees, see Basic Ideas. See also, Exploratory Data Analysis and Data Mining Techniques.

## Computational Methods

The process of computing classification trees can be characterized as involving four basic steps:

### Specifying the Criteria for Predictive Accuracy

The goal of classification tree analysis, simply stated, is to obtain the most accurate prediction possible. Unfortunately, an operational definition of accurate prediction is hard to come by. To solve the problem of defining predictive accuracy, the problem is "stood on its head," and the most accurate prediction is operationally defined as the prediction with the minimum costs. The term costs need not seem mystifying. In many typical applications, costs simply correspond to the proportion of misclassified cases. The notion of costs was developed as a way to generalize, to a broader range of prediction situations, the idea that the best prediction has the lowest misclassification rate.

The need for minimizing costs, rather than just the proportion of misclassified cases, arises when some predictions that fail are more catastrophic than others, or when some predictions that fail occur more frequently than others. The costs to a gambler of losing a single bet (or prediction) on which the gambler's whole fortune is at stake are greater than the costs of losing many bets (or predictions) on which a tiny part of the gambler's fortune is at stake. Conversely, the costs of losing many small bets can be larger than the costs of losing just a few bigger bets. We should spend proportionately more effort in minimizing losses on bets where losing (making errors in prediction) costs us more.

Priors. Minimizing costs, however, does correspond to minimizing the proportion of misclassified cases when Priors are taken to be proportional to the class sizes and when Misclassification costs are taken to be equal for every class. We will address Priors first. Priors, or, a priori probabilities, specify how likely it is, without using any prior knowledge of the values for the predictor variables in the model, that a case or object will fall into one of the classes. For example, in an educational study of high school drop-outs, it may happen that, overall, there are fewer drop-outs than students who stay in school (i.e., there are different base rates); thus, the a priori probability that a student drops out is lower than that a student remains in school.

The a priori probabilities used in minimizing costs can greatly affect the classification of cases or objects. If differential base rates are not of interest for the study, or if we know that there are about an equal number of cases in each class, then we would use equal priors. If the differential base rates are reflected in the class sizes (as they would be if the sample is a probability sample) then we would use priors estimated by the class proportions of the sample. Finally, if we have specific knowledge about the base rates (for example, based on previous research), then we would specify priors in accordance with that knowledge. For example, a priori probabilities for carriers of a recessive gene could be specified as twice as high as for individuals who display a disorder caused by the recessive gene. The general point is that the relative size of the priors assigned to each class can be used to "adjust" the importance of misclassifications for each class. Minimizing costs corresponds to minimizing the overall proportion of misclassified cases when Priors are taken to be proportional to the class sizes (and Misclassification costs are taken to be equal for every class), because prediction should be better in larger classes to produce an overall lower misclassification rate.

Misclassification costs. Sometimes more accurate classification is desired for some classes than others for reasons unrelated to relative class sizes. Regardless of their relative frequency, carriers of a disease who are contagious to others might need to be more accurately predicted than carriers of the disease who are not contagious to others. If we assume that little is lost in avoiding a non-contagious person but much is lost in not avoiding a contagious person, higher misclassification costs could be specified for misclassifying a contagious carrier as non-contagious than for misclassifying a non-contagious person as contagious. But to reiterate, minimizing costs corresponds to minimizing the proportion of misclassified cases when Priors are taken to be proportional to the class sizes and when Misclassification costs are taken to be equal for every class.

Case weights. A little less conceptually, the use of case weights on a weighting variable as case multipliers for aggregated data sets is also related to the issue of minimizing costs. Interestingly, as an alternative to using case weights for aggregated data sets, we could specify appropriate priors and/or misclassification costs and produce the same results while avoiding the additional processing required to analyze multiple cases with the same values for all variables. Suppose that in an aggregated data set with two classes having an equal number of cases, there are case weights of 2 for all the cases in the first class, and case weights of 3 for all the cases in the second class. If we specify priors of .4 and .6, respectively, specify equal misclassification costs, and analyze the data without case weights, we will get the same misclassification rates as we would get if we specify priors estimated by the class sizes, specify equal misclassification costs, and analyze the aggregated data set using the case weights. We would also get the same misclassification rates if we specify priors to be equal, specify the costs of misclassifying class 1 cases as class 2 cases to be 2/3 of the costs of misclassifying class 2 cases as class 1 cases, and analyze the data without case weights.

The relationships between priors, misclassification costs, and case weights become quite complex in all but the simplest situations (for discussions, see Breiman et al, 1984; Ripley, 1996). In analyses where minimizing costs corresponds to minimizing the misclassification rate, however, these issues need not cause any concern. Priors, misclassification costs, and case weights are brought up here, however, to illustrate the wide variety of prediction situations that can be handled using the concept of minimizing costs, as compared to the rather limited (but probably typical) prediction situations that can be handled using the narrower (but simpler) idea of minimizing misclassification rates. Furthermore, minimizing costs is an underlying goal of classification tree analysis, and is explicitly addressed in the fourth and final basic step in classification tree analysis, where in trying to select the "right-sized" tree, we choose the tree with the minimum estimated costs. Depending on the type of prediction problem we are trying to solve, understanding the idea of reduction of estimated costs may be important for understanding the results of the analysis.

### Selecting Splits

The second basic step in classification tree analysis is to select the splits on the predictor variables that are used to predict membership in the classes of the dependent variables for the cases or objects in the analysis. Not surprisingly, given the hierarchical nature of classification trees, these splits are selected one at time, starting with the split at the root node, and continuing with splits of resulting child nodes until splitting stops, and the child nodes that have not been split become terminal nodes. Three Split selection methods are discussed here.

Discriminant-based univariate splits. The first step in split selection when the Discriminant-based univariate splits option is chosen is to determine the best terminal node to split in the current tree, and which predictor variable to use to perform the split. For each terminal node, p-values are computed for tests of the significance of the relationship of class membership with the levels of each predictor variable. For categorical predictors, the p-values are computed for Chi-square tests of independence of the classes and the levels of the categorical predictor that are present at the node. For ordered predictors, the p-values are computed for ANOVAs of the relationship of the classes to the values of the ordered predictor that are present at the node. If the smallest computed p-value is smaller than the default Bonferroni-adjusted p-value for multiple comparisons of .05 (a different threshold value can be used), the predictor variable producing that smallest p-value is chosen to split the corresponding node. If no p-value smaller than the threshold p-value is found, p-values are computed for statistical tests that are robust to distributional violations, such as Levene's F. Details concerning node and predictor variable selection when no p-value is smaller than the specified threshold are described in Loh and Shih (1997).

The next step is to determine the split. For ordered predictors, the 2-means clustering algorithm of Hartigan and Wong (1979, see also Cluster Analysis) is applied to create two "superclasses" for the node. The two roots are found for a quadratic equation describing the difference in the means of the "superclasses" on the ordered predictor, and the values for a split corresponding to each root are computed. The split closest to a "superclass" mean is selected. For categorical predictors, dummy-coded variables representing the levels of the categorical predictor are constructed, and then singular value decomposition methods are applied to transform the dummy-coded variables into a set of non-redundant ordered predictors. The procedures for ordered predictors are then applied and the obtained split is "mapped back" onto the original levels of the categorical variable and represented as a contrast between two sets of levels of the categorical variable. Again, further details about these procedures are described in Loh and Shih (1997). Although complicated, these procedures reduce a bias in split selection that occurs when using the C&RT-style exhaustive search method for selecting splits. This is the bias toward selecting variables with more levels for splits, a bias that can skew the interpretation of the relative importance of the predictors in explaining responses on the dependent variable (Breiman et. al., 1984).

Discriminant-based linear combination splits. The second split selection method is the Discriminant-based linear combination split option for ordered predictor variables (however, the predictors are assumed to be measured on at least interval scales). Surprisingly, this method works by treating the continuous predictors from which linear combinations are formed in a manner that is similar to the way categorical predictors are treated in the previous method. Singular value decomposition methods are used to transform the continuous predictors into a new set of non-redundant predictors. The procedures for creating "superclasses" and finding the split closest to a "superclass" mean are then applied, and the results are "mapped back" onto the original continuous predictors and represented as a univariate split on a linear combination of predictor variables.

C&RT-style exhaustive search for univariate splits. The third split-selection method is the C&RT-style exhaustive search for univariate splits method for categorical or ordered predictor variables. With this method, all possible splits for each predictor variable at each node are examined to find the split producing the largest improvement in goodness of fit (or equivalently, the largest reduction in lack of fit). What determines the domain of possible splits at a node? For categorical predictor variables with k levels present at a node, there are 2(k-1) - 1 possible contrasts between two sets of levels of the predictor. For ordered predictors with k distinct levels present at a node, there are k -1 midpoints between distinct levels. Thus it can be seen that the number of possible splits that must be examined can become very large when there are large numbers of predictors with many levels that must be examined at many nodes.

How is improvement in goodness of fit determined? Three choices of Goodness of fit measures are discussed here. The Gini measure of node impurity is a measure that reaches a value of zero when only one class is present at a node (with priors estimated from class sizes and equal misclassification costs, the Gini measure is computed as the sum of products of all pairs of class proportions for classes present at the node; it reaches its maximum value when class sizes at the node are equal). The Gini measure was the measure of goodness of fit preferred by the developers of C&RT (Breiman et. al., 1984). The two other indices are the Chi-square measure, which is similar to Bartlett's Chi-square (Bartlett, 1948), and the G-square measure, which is similar to the maximum-likelihood Chi-square used in structural equation modeling. The C&RT-style exhaustive search for univariate splits method works by searching for the split that maximizes the reduction in the value of the selected goodness of fit measure. When the fit is perfect, classification is perfect.

### Determining When to Stop Splitting

The third step in classification tree analysis is to determine when to stop splitting. One characteristic of classification trees is that if no limit is placed on the number of splits that are performed, eventually "pure" classification will be achieved, with each terminal node containing only one class of cases or objects. However, "pure" classification is usually unrealistic. Even a simple classification tree such as a coin sorter can produce impure classifications for coins whose sizes are distorted or if wear changes the lengths of the slots cut in the track. This potentially could be remedied by further sorting of the coins that fall into each slot, but to be practical, at some point the sorting would have to stop and we would have to accept that the coins have been reasonably well sorted.

Likewise, if the observed classifications on the dependent variable or the levels on the predicted variable in a classification tree analysis are measured with error or contain "noise," it is unrealistic to continue to sort until every terminal node is "pure." Two options for controlling when splitting stops will be discussed here. These two options are linked to the choice of the Stopping rule specified for the analysis.

Minimum n. One option for controlling when splitting stops is to allow splitting to continue until all terminal nodes are pure or contain no more than a specified minimum number of cases or objects. The desired minimum number of cases can be specified as the Minimum n, and splitting will stop when all terminal nodes containing more than one class have no more than the specified number of cases or objects.

Fraction of objects. Another option for controlling when splitting stops is to allow splitting to continue until all terminal nodes are pure or contain no more cases than a specified minimum fraction of the sizes of one or more classes. The desired minimum fraction can be specified as the Fraction of objects and, if the priors used in the analysis are equal and class sizes are equal, splitting will stop when all terminal nodes containing more than one class have no more cases than the specified fraction of the class sizes for one or more classes. If the priors used in the analysis are not equal, splitting will stop when all terminal nodes containing more than one class have no more cases than the specified fraction for one or more classes.

### Selecting the "Right-Sized" Tree

After a night at the horse track, a studious gambler computes a huge classification tree with numerous splits that perfectly account for the win, place, show, and no show results for every horse in every race. Expecting to become rich, the gambler takes a copy of the tree graph to the races the next night, sorts the horses racing that night using the classification tree, makes his or her predictions and places his or her bets, and leaves the race track later much less rich than had been expected. The poor gambler has foolishly assumed that a classification tree computed from a learning sample in which the outcomes are already known will perform equally well in predicting outcomes in a second, independent test sample. The gambler's classification tree performed poorly during cross-validation. The gambler's payoff might have been larger using a smaller classification tree that did not classify perfectly in the learning sample, but which was expected to predict equally well in the test sample. .

Some generalizations can be offered about what constitutes the "right-sized" classification tree. It should be sufficiently complex to account for the known facts, but at the same time it should be as simple as possible. It should exploit information that increases predictive accuracy and ignore information that does not. It should, if possible, lead to greater understanding of the phenomena that it describes. Of course, these same characteristics apply to any scientific theory, so we must try to be more specific about what constitutes the "right-sized" classification tree. One strategy is to grow the tree to just the right size, where the right size is determined by the user from knowledge from previous research, diagnostic information from previous analyses, or even intuition. The other strategy is to use a set of well-documented, structured procedures developed by Breiman et al. (1984) for selecting the "right-sized" tree. These procedures are not foolproof, as Breiman et al. (1984) readily acknowledge, but at least they take subjective judgment out of the process of selecting the "right-sized" tree.

FACT-style direct stopping. We will begin by describing the first strategy, in which the researcher specifies the size to grow the classification tree. This strategy is followed by using FACT-style direct stopping as the Stopping rule for the analysis and by specifying the Fraction of objects, which allows the tree to grow to the desired size. There are several options for obtaining diagnostic information to determine the reasonableness of the choice of size for the tree. Three options for performing cross-validation of the selected classification tree are discussed below.

Test sample cross-validation. The first, and most preferred type of cross-validation is test sample cross-validation. In this type of cross-validation, the classification tree is computed from the learning sample, and its predictive accuracy is tested by applying it to predict class membership in the test sample. If the costs for the test sample exceed the costs for the learning sample (remember, costs equal the proportion of misclassified cases when priors are estimated and misclassification costs are equal), this indicates poor cross-validation and that a different sized tree might cross-validate better. The test and learning samples can be formed by collecting two independent data sets, or if a large learning sample is available, by reserving a randomly selected proportion of the cases, say a third or a half, for use as the test sample.

V-fold cross-validation. This type of cross-validation is useful when no test sample is available and the learning sample is too small to have the test sample taken from it. A specified V value for V-fold cross-validation determines the number of random subsamples, as equal in size as possible, that are formed from the learning sample. The classification tree of the specified size is computed V times, each time leaving out one of the subsamples from the computations, and using that subsample as a test sample for cross-validation, so that each subsample is used V - 1 times in the learning sample and just once as the test sample. The CV costs computed for each of the V test samples are then averaged to give the V-fold estimate of the CV costs.

Global cross-validation. In global cross-validation, the entire analysis is replicated a specified number of times holding out a fraction of the learning sample equal to 1 over the specified number of times, and using each hold-out sample in turn as a test sample to cross-validate the selected classification tree. This type of cross-validation is probably no more useful than V-fold cross-validation when FACT-style direct stopping is used, but can be quite useful as a method validation procedure when automatic tree selection techniques are used (for discussion, see Breiman et. al., 1984). This brings us to the second of the two strategies that can used to select the "right-sized" tree, an automatic tree selection method based on a technique developed by Breiman et al. (1984) called minimal cost-complexity cross-validation pruning.

Minimal cost-complexity cross-validation pruning. Two methods of pruning can be used depending on the Stopping Rule we choose to use. Minimal cost-complexity cross-validation pruning is performed when we decide to Prune on misclassification error (as a Stopping rule), and minimal deviance-complexity cross-validation pruning is performed when we choose to Prune on deviance (as a Stopping rule). The only difference in the two options is the measure of prediction error that is used. Prune on misclassification error uses the costs that we have discussed repeatedly (which equal the misclassification rate when priors are estimated and misclassification costs are equal). Prune on deviance uses a measure, based on maximum-likelihood principles, called the deviance (see Ripley, 1996). We will focus on cost-complexity cross-validation pruning (as originated by Breiman et. al., 1984), since deviance-complexity pruning merely involves a different measure of prediction error.

The costs needed to perform cost-complexity pruning are computed as the tree is being grown, starting with the split at the root node up to its maximum size, as determined by the specified Minimum n. The learning sample costs are computed as each split is added to the tree, so that a sequence of generally decreasing costs (reflecting better classification) are obtained corresponding to the number of splits in the tree. The learning sample costs are called resubstitution costs to distinguish them from CV costs, because V-fold cross-validation is also performed as each split is added to the tree. Use the estimated CV costs from V-fold cross-validation as the costs for the root node. Note that tree size can be taken to be the number of terminal nodes, because for binary trees the tree size starts at one (the root node) and increases by one with each added split. Now, define a parameter called the complexity parameter whose initial value is zero, and for every tree (including the first, containing only the root node), compute the value for a function defined as the costs for the tree plus the complexity parameter times the tree size. Increase the complexity parameter continuously until the value of the function for the largest tree exceeds the value of the function for a smaller-sized tree. Take the smaller-sized tree to be the new largest tree, continue increasing the complexity parameter continuously until the value of the function for the largest tree exceeds the value of the function for a smaller-sized tree, and continue the process until the root node is the largest tree. (Those who are familiar with numerical analysis will recognize the use of a penalty function in this algorithm. The function is a linear combination of costs, which generally decrease with tree size, and tree size, which increases linearly. As the complexity parameter is increased, larger trees are penalized for their complexity more and more, until a discrete threshold is reached at which a smaller-sized tree's higher costs are outweighed by the largest tree's higher complexity)

The sequence of largest trees obtained by this algorithm have a number of interesting properties. They are nested, because successively pruned trees contain all the nodes of the next smaller tree in the sequence. Initially, many nodes are often pruned going from one tree to the next smaller tree in the sequence, but fewer nodes tend to be pruned as the root node is approached. The sequence of largest trees is also optimally pruned, because for every size of tree in the sequence, there is no other tree of the same size with lower costs. Proofs and/or explanations of these properties can be found in Breiman et al. (1984).

Tree selection after pruning. We now select the "right-sized" tree from the sequence of optimally pruned trees. A natural criterion is the CV costs. While there is nothing wrong with choosing the tree with the minimum CV costs as the "right-sized" tree, oftentimes there will be several trees with CV costs close to the minimum. Breiman et al. (1984) make the reasonable suggestion that we should choose as the "right-sized" tree the smallest-sized (least complex) tree whose CV costs do not differ appreciably from the minimum CV costs. They proposed a "1 SE rule" for making this selection, i.e., choose as the "right-sized" tree the smallest-sized tree whose CV costs do not exceed the minimum CV costs plus 1 times the Standard error of the CV costs for the minimum CV costs tree.

One distinct advantage of the "automatic" tree selection procedure is that it helps to avoid "overfitting" and "underfitting" of the data. The graph below shows a typical plot of the Resubstitution costs and CV costs for the sequence of successively pruned trees.

As shown in this graph, the Resubstitution costs (e.g., the misclassification rate in the learning sample) rather consistently decrease as tree size increases. The CV costs, on the other hand, approach the minimum quickly as tree size initially increases, but actually start to rise as tree size becomes very large. Note that the selected "right-sized" tree is close to the inflection point in the curve, that is, close to the point where the initial sharp drop in CV costs with increased tree size starts to level out. The "automatic" tree selection procedure is designed to select the simplest (smallest) tree with close to minimum CV costs, and thereby avoid the loss in predictive accuracy produced by "underfitting" or "overfitting" the data (note the similarity to the logic underlying the use of a "scree plot" to determine the number of factors to retain in Factor Analysis; see also Reviewing the Results of a Principal Components Analysis).

As has been seen, minimal cost-complexity cross-validation pruning and subsequent "right-sized" tree selection is a truly "automatic" process. The algorithms make all the decisions leading to selection of the "right-sized" tree, except for, perhaps, specification of a value for the SE rule. One issue that arises with the use of such "automatic" procedures is how well the results replicate, where replication might involve the selection of trees of quite different sizes across replications, given the "automatic" selection process that is used. This is where global cross-validation can be very useful. As explained previously, in global cross-validation, the entire analysis is replicated a specified number of times (3 is the default) holding out a fraction of the cases to use as a test sample to cross-validate the selected classification tree. If the average of the costs for the test samples, called the global CV costs, exceeds the CV costs for the selected tree, or if the standard error of the global CV costs exceeds the standard error of the CV costs for the selected tree, this indicates that the "automatic" tree selection procedure is allowing too much variability in tree selection rather than consistently selecting a tree with minimum estimated costs.

Classification trees and traditional methods. As can be seen in the methods used in computing classification trees, in a number of respects classification trees are decidedly different from traditional statistical methods for predicting class membership on a categorical dependent variable. They employ a hierarchy of predictions, with many predictions sometimes being applied to particular cases, to sort the cases into predicted classes. Traditional methods use simultaneous techniques to make one and only one class membership prediction for each and every case. In other respects, such as having as its goal accurate prediction, classification tree analysis is indistinguishable from traditional methods. Time will tell if classification tree analysis has enough to commend itself to become as accepted as the traditional methods.

For information on the basic purpose of classification trees, see Basic Ideas. For information on the hierarchical nature and flexibility of classification trees, see Characteristics of Classification Trees. See also, Exploratory Data Analysis and Data Mining Techniques.

## A Brief Comparison of Classification Tree Programs

A variety of classification tree programs have been developed to predict membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables. In the previous section, Computational Methods, we have discussed the QUEST (Loh & Shih, 1997) and C&RT (Breiman et. al., 1984) programs for computing binary classification trees based on univariate splits for categorical predictor variables, ordered predictor variables (measured on at least an ordinal scale), or a mix of both types of predictors. We have also discussed computing classification trees based on linear combination splits for interval scale predictor variables.

Some classification trees programs, such as FACT (Loh & Vanichestakul, 1988) and THAID (Morgan & Messenger, 1973, as well as the related programs AID, for Automatic Interaction Detection, Morgan & Sonquist, 1963, and CHAID, for Chi-Square Automatic Interaction Detection, Kass, 1980) perform multi-level splits rather than binary splits when computing classification trees. A multi-level split performs k - 1 splits (where k is the number of levels of the splitting variable), as compared to a binary split that performs one split (regardless of the number of levels of the splitting variable). However, it should be noted that there is no inherent advantage of multi-level splits, because any multi-level split can be represented as a series of binary splits, and there may be disadvantages of using multi-level splits. With multi-level splits, predictor variables can be used for splitting only once, so the resulting classification trees may be unrealistically short and uninteresting (Loh & Shih, 1997). A more serious problem is bias in variable selection for splits. This bias is possible in any program such as THAID (Morgan & Sonquist, 1963) that employs an exhaustive search for finding splits (for a discussion, see Loh & Shih, 1997). Bias in variable selection is the bias toward selecting variables with more levels for splits, a bias that can skew the interpretation of the relative importance of the predictors in explaining responses on the dependent variable (Breiman et. al., 1984).

Bias in variable selection can be avoided by using the Discriminant-based (univariate or linear combination) split options. These options make use of the algorithms in QUEST (Loh & Shih, 1997) to prevent bias in variable selection. The C&RT-style exhaustive search for univariate splits option is useful if one's goal is to find splits producing the best possible classification in the learning sample (but not necessarily in independent cross-validation samples). For reliable splits, as well as computational speed, the Discriminant-based split options are recommended. For information on techniques and issues in computing classification trees, see the Computational Methods section.

Building trees interactively. In contrast, another method for building trees that has proven popular in applied research and data exploration is based on experts' knowledge about the domain or area under investigation, and relies on interactive choices (for how to grow the tree) by such experts to arrive at "good" (valid) models for prediction or predictive classification. In other words, instead of building trees automatically, using sophisticated algorithms for choosing good predictors and splits (for growing the branches of the tree), a user may want to determine manually which variables to include in the tree, and how to split those variables to create the branches of the tree. This enables the user to experiment with different variables and scenarios, and ideally to derive a better understanding of the phenomenon under investigation by combining her or his expertise with the analytic capabilities and options for building the. In practice, it may often be most useful to combine the automatic methods for building trees with "educated guesses" and domain-specific expertise. We may want to grow some portions of the tree using automatic methods and refine and modify the tree based on our expertise. Another common situation where this type of combined automatic and interactive tree building is called for is when some variables that are chosen automatically for some splits are not easily observable because they cannot be measured reliably or economically (i.e., obtaining such measurements would be too expensive). For example, suppose the automatic analysis at some point selects a variable Income as a good predictor for the next split; however, we may not be able to obtain reliable data on income from the new sample to which we want to apply the results of the current analysis (e.g., for predicting some behavior of interest, such as whether the person will purchase something from our catalog). In this case, we may want to select a "surrogate" variable, i.e., a variable that we can observe easily and that is likely related or similar to variable Income (with respect to its predictive power; for example, a variable Number of years of education may be related to Income and have similar predictive power; while most people are reluctant to reveal their level of income, they are more likely to report their level of education, and hence, this latter variable is more easily measured).

[Select Rating]

# How To Group Objects Into Similar Categories, Cluster Analysis

## General Purpose

The term cluster analysis (first used by Tryon, 1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories. A general question facing researchers in many areas of inquiry is how to organize observed data into meaningful structures, that is, to develop taxonomies. In other words cluster analysis is an exploratory data analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group and minimal otherwise. Given the above, cluster analysis can be used to discover structures in data without providing an explanation/interpretation. In other words, cluster analysis simply discovers structures in data without explaining why they exist.

We deal with clustering in almost every aspect of daily life. For example, a group of diners sharing the same table in a restaurant may be regarded as a cluster of people. In food stores items of similar nature, such as different types of meat or vegetables are displayed in the same or nearby locations. There is a countless number of examples in which clustering plays an important role. For instance, biologists have to organize the different species of animals before a meaningful description of the differences between animals is possible. According to the modern system employed in biology, man belongs to the primates, the mammals, the amniotes, the vertebrates, and the animals. Note how in this classification, the higher the level of aggregation the less similar are the members in the respective class. Man has more in common with all other primates (e.g., apes) than it does with the more "distant" members of the mammals (e.g., dogs), etc. For a review of the general categories of cluster analysis methods, see Joining (Tree Clustering), Two-way Joining (Block Clustering), and k-Means Clustering. In short, whatever the nature of your business is, sooner or later you will run into a clustering problem of one form or another.

## Statistical Significance Testing

Note that the above discussions refer to clustering algorithms and do not mention anything about statistical significance testing. In fact, cluster analysis is not as much a typical statistical test as it is a "collection" of different algorithms that "put objects into clusters according to well defined similarity rules." The point here is that, unlike many other statistical procedures, cluster analysis methods are mostly used when we do not have any a priori hypotheses, but are still in the exploratory phase of our research. In a sense, cluster analysis finds the "most significant solution possible." Therefore, statistical significance testing is really not appropriate here, even in cases when p-levels are reported (as in k-means clustering).

## Area of Application

Clustering techniques have been applied to a wide variety of research problems. Hartigan (1975) provides an excellent summary of the many published studies reporting the results of cluster analyses. For example, in the field of medicine, clustering diseases, cures for diseases, or symptoms of diseases can lead to very useful taxonomies. In the field of psychiatry, the correct diagnosis of clusters of symptoms such as paranoia, schizophrenia, etc. is essential for successful therapy. In archeology, researchers have attempted to establish taxonomies of stone tools, funeral objects, etc. by applying cluster analytic techniques. In general, whenever we need to classify a "mountain" of information into manageable meaningful piles, cluster analysis is of great utility.

## Joining (Tree Clustering)

### General Logic

The example in the General Purpose Introduction illustrates the goal of the joining or tree clustering algorithm. The purpose of this algorithm is to join together objects (e.g., animals) into successively larger clusters, using some measure of similarity or distance. A typical result of this type of clustering is the hierarchical tree.

### Hierarchical Tree

Consider a Horizontal Hierarchical Tree Plot (see graph below), on the left of the plot, we begin with each object in a class by itself. Now imagine that, in very small steps, we "relax" our criterion as to what is and is not unique. Put another way, we lower our threshold regarding the decision when to declare two or more objects to be members of the same cluster.

As a result we link more and more objects together and aggregate (amalgamate) larger and larger clusters of increasingly dissimilar elements. Finally, in the last step, all objects are joined together. In these plots, the horizontal axis denotes the linkage distance (in Vertical Icicle Plots, the vertical axis denotes the linkage distance). Thus, for each node in the graph (where a new cluster is formed) we can read off the criterion distance at which the respective elements were linked together into a new single cluster. When the data contain a clear "structure" in terms of clusters of objects that are similar to each other, then this structure will often be reflected in the hierarchical tree as distinct branches. As the result of a successful analysis with the joining method, we are able to detect clusters (branches) and interpret those branches.

### Distance Measures

The joining or tree clustering method uses the dissimilarities (similarities) or distances between objects when forming the clusters. Similarities are a set of rules that serve as criteria for grouping or separating items. In the previous example the rule for grouping a number of dinners was whether they shared the same table or not. These distances (similarities) can be based on a single dimension or multiple dimensions, with each dimension representing a rule or condition for grouping objects. For example, if we were to cluster fast foods, we could take into account the number of calories they contain, their price, subjective ratings of taste, etc. The most straightforward way of computing distances between objects in a multi-dimensional space is to compute Euclidean distances. If we had a two- or three-dimensional space this measure is the actual geometric distance between objects in the space (i.e., as if measured with a ruler). However, the joining algorithm does not "care" whether the distances that are "fed" to it are actual real distances, or some other derived measure of distance that is more meaningful to the researcher; and it is up to the researcher to select the right method for his/her specific application.

Euclidean distance. This is probably the most commonly chosen type of distance. It simply is the geometric distance in the multidimensional space. It is computed as:

distance(x,y) = {i (xi - yi)2 }½

Note that Euclidean (and squared Euclidean) distances are usually computed from raw data, and not from standardized data. This method has certain advantages (e.g., the distance between any two objects is not affected by the addition of new objects to the analysis, which may be outliers). However, the distances can be greatly affected by differences in scale among the dimensions from which the distances are computed. For example, if one of the dimensions denotes a measured length in centimeters, and you then convert it to millimeters (by multiplying the values by 10), the resulting Euclidean or squared Euclidean distances (computed from multiple dimensions) can be greatly affected (i.e., biased by those dimensions which have a larger scale), and consequently, the results of cluster analyses may be very different. Generally, it is good practice to transform the dimensions so they have similar scales.

Squared Euclidean distance. You may want to square the standard Euclidean distance in order to place progressively greater weight on objects that are further apart. This distance is computed as (see also the note in the previous paragraph):

distance(x,y) = i (xi - yi)2

City-block (Manhattan) distance. This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since they are not squared). The city-block distance is computed as:

distance(x,y) = i |xi - yi|

Chebychev distance. This distance measure may be appropriate in cases when we want to define two objects as "different" if they are different on any one of the dimensions. The Chebychev distance is computed as:

distance(x,y) = Maximum|xi - yi|

Power distance. Sometimes we may want to increase or decrease the progressive weight that is placed on dimensions on which the respective objects are very different. This can be accomplished via the power distance. The power distance is computed as:

distance(x,y) = (i |xi - yi|p)1/r

where r and p are user-defined parameters. A few example calculations may demonstrate how this measure "behaves." Parameter p controls the progressive weight that is placed on differences on individual dimensions, parameter r controls the progressive weight that is placed on larger differences between objects. If r and p are equal to 2, then this distance is equal to the Euclidean distance.

Percent disagreement. This measure is particularly useful if the data for the dimensions included in the analysis are categorical in nature. This distance is computed as:

distance(x,y) = (Number of xi yi)/ i

At the first step, when each object represents its own cluster, the distances between those objects are defined by the chosen distance measure. However, once several objects have been linked together, how do we determine the distances between those new clusters? In other words, we need a linkage or amalgamation rule to determine when two clusters are sufficiently similar to be linked together. There are various possibilities: for example, we could link two clusters together when any two objects in the two clusters are closer together than the respective linkage distance. Put another way, we use the "nearest neighbors" across clusters to determine the distances between clusters; this method is called single linkage. This rule produces "stringy" types of clusters, that is, clusters "chained together" by only single objects that happen to be close together. Alternatively, we may use the neighbors across clusters that are furthest away from each other; this method is called complete linkage. There are numerous other linkage rules such as these that have been proposed.

Single linkage (nearest neighbor). As described above, in this method the distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters. This rule will, in a sense, string objects together to form clusters, and the resulting clusters tend to represent long "chains."

Complete linkage (furthest neighbor). In this method, the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbors"). This method usually performs quite well in cases when the objects actually form naturally distinct "clumps." If the clusters tend to be somehow elongated or of a "chain" type nature, then this method is inappropriate.

Unweighted pair-group average. In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. This method is also very efficient when the objects form natural distinct "clumps," however, it performs equally well with elongated, "chain" type clusters. Note that in their book, Sneath and Sokal (1973) introduced the abbreviation UPGMA to refer to this method as unweighted pair-group method using arithmetic averages.

Weighted pair-group average. This method is identical to the unweighted pair-group average method, except that in the computations, the size of the respective clusters (i.e., the number of objects contained in them) is used as a weight. Thus, this method (rather than the previous method) should be used when the cluster sizes are suspected to be greatly uneven. Note that in their book, Sneath and Sokal (1973) introduced the abbreviation WPGMA to refer to this method as weighted pair-group method using arithmetic averages.

Unweighted pair-group centroid. The centroid of a cluster is the average point in the multidimensional space defined by the dimensions. In a sense, it is the center of gravity for the respective cluster. In this method, the distance between two clusters is determined as the difference between centroids. Sneath and Sokal (1973) use the abbreviation UPGMC to refer to this method as unweighted pair-group method using the centroid average.

Weighted pair-group centroid (median). This method is identical to the previous one, except that weighting is introduced into the computations to take into consideration differences in cluster sizes (i.e., the number of objects contained in them). Thus, when there are (or we suspect there to be) considerable differences in cluster sizes, this method is preferable to the previous one. Sneath and Sokal (1973) use the abbreviation WPGMC to refer to this method as weighted pair-group method using the centroid average.

Ward's method. This method is distinct from all other methods because it uses an analysis of variance approach to evaluate the distances between clusters. In short, this method attempts to minimize the Sum of Squares (SS) of any two (hypothetical) clusters that can be formed at each step. Refer to Ward (1963) for details concerning this method. In general, this method is regarded as very efficient, however, it tends to create clusters of small size.

For an overview of the other two methods of clustering, see Two-way Joining and k-Means Clustering.

## Two-Way Joining

### Introductory Overview

Previously, we have discussed this method in terms of "objects" that are to be clustered (see Joining (Tree Clustering)). In all other types of analyses the research question of interest is usually expressed in terms of cases (observations) or variables. It turns out that the clustering of both may yield useful results. For example, imagine a study where a medical researcher has gathered data on different measures of physical fitness (variables) for a sample of heart patients (cases). The researcher may want to cluster cases (patients) to detect clusters of patients with similar syndromes. At the same time, the researcher may want to cluster variables (fitness measures) to detect clusters of measures that appear to tap similar physical abilities.

### Two-Way Joining

Given the discussion in the paragraph above concerning whether to cluster cases or variables, we may wonder why not cluster both simultaneously? Two-way joining is useful in (the relatively rare) circumstances when we expect that both cases and variables will simultaneously contribute to the uncovering of meaningful patterns of clusters.

For example, returning to the example above, the medical researcher may want to identify clusters of patients that are similar with regard to particular clusters of similar measures of physical fitness. The difficulty with interpreting these results may arise from the fact that the similarities between different clusters may pertain to (or be caused by) somewhat different subsets of variables. Thus, the resulting structure (clusters) is by nature not homogeneous. This may seem a bit confusing at first, and, indeed, compared to the other clustering methods described (see Joining (Tree Clustering) and k-Means Clustering), two-way joining is probably the one least commonly used. However, some researchers believe that this method offers a powerful exploratory data analysis tool (for more information you may want to refer to the detailed description of this method in Hartigan, 1975).

## k-Means Clustering

### General Logic

This method of clustering is very different from the Joining (Tree Clustering) and Two-way Joining. Suppose that you already have hypotheses concerning the number of clusters in your cases or variables. You may want to "tell" the computer to form exactly 3 clusters that are to be as distinct as possible. This is the type of research question that can be addressed by the k- means clustering algorithm. In general, the k-means method will produce exactly k different clusters of greatest possible distinction. It should be mentioned that the best number of clusters k leading to the greatest separation (distance) is not known as a priori and must be computed from the data (see Finding the Right Number of Clusters).

### Example

In the physical fitness example (see Two-way Joining), the medical researchers may have a "hunch" from clinical experience that their heart patients fall basically into three different categories with regard to physical fitness. They might wonder whether this intuition can be quantified, that is, whether a k-means cluster analysis of the physical fitness measures would indeed produce the three clusters of patients as expected. If so, the means on the different measures of physical fitness for each cluster would represent a quantitative way of expressing the researchers' hypothesis or intuition (i.e., patients in cluster 1 are high on measure 1, low on measure 2, etc.).

### Computations

Computationally, you may think of this method as analysis of variance (ANOVA) "in reverse." The program will start with k random clusters, and then move objects between those clusters with the goal to 1) minimize variability within clusters and 2) maximize variability between clusters. In other words, the similarity rules will apply maximally to the members of one cluster and minimally to members belonging to the rest of the clusters. This is analogous to "ANOVA in reverse" in the sense that the significance test in ANOVA evaluates the between group variability against the within-group variability when computing the significance test for the hypothesis that the means in the groups are different from each other. In k-means clustering, the program tries to move objects (e.g., cases) in and out of groups (clusters) to get the most significant ANOVA results.

### Interpretation of Results

Usually, as the result of a k-means clustering analysis, we would examine the means for each cluster on each dimension to assess how distinct our k clusters are. Ideally, we would obtain very different means for most, if not all dimensions, used in the analysis. The magnitude of the F values from the analysis of variance performed on each dimension is another indication of how well the respective dimension discriminates between clusters.

## EM (Expectation Maximization) Clustering

### Introductory Overview

The methods described here are similar to the k-Means algorithm described above, and you may want to review that section for a general overview of these techniques and their applications. The general purpose of these techniques is to detect clusters in observations (or variables) and to assign those observations to the clusters. A typical example application for this type of analysis is a marketing research study in which a number of consumer behavior related variables are measured for a large sample of respondents. The purpose of the study is to detect "market segments," i.e., groups of respondents that are somehow more similar to each other (to all other members of the same cluster) when compared to respondents that "belong to" other clusters. In addition to identifying such clusters, it is usually equally of interest to determine how the clusters are different, i.e., determine the specific variables or dimensions that vary and how they vary in regard to members in different clusters.

k-means clustering. To reiterate, the classic k-Means algorithm was popularized and refined by Hartigan (1975; see also Hartigan and Wong, 1978). The basic operation of that algorithm is relatively simple: Given a fixed number of (desired or hypothesized) k clusters, assign observations to those clusters so that the means across clusters (for all variables) are as different from each other as possible.

Extensions and generalizations. The EM (expectation maximization) algorithm extends this basic approach to clustering in two important ways:

1. Instead of assigning cases or observations to clusters to maximize the differences in means for continuous variables, the EM clustering algorithm computes probabilities of cluster memberships based on one or more probability distributions. The goal of the clustering algorithm then is to maximize the overall probability or likelihood of the data, given the (final) clusters.
2. Unlike the classic implementation of k-means clustering, the general EM algorithm can be applied to both continuous and categorical variables (note that the classic k-means algorithm can also be modified to accommodate categorical variables).

### The EM Algorithm

The EM algorithm for clustering is described in detail in Witten and Frank (2001). The basic approach and logic of this clustering method is as follows. Suppose you measure a single continuous variable in a large sample of observations. Further, suppose that the sample consists of two clusters of observations with different means (and perhaps different standard deviations); within each sample, the distribution of values for the continuous variable follows the normal distribution. The resulting distribution of values (in the population) may look like this:

Mixtures of distributions. The illustration shows two normal distributions with different means and different standard deviations, and the sum of the two distributions. Only the mixture (sum) of the two normal distributions (with different means and standard deviations) would be observed. The goal of EM clustering is to estimate the means and standard deviations for each cluster so as to maximize the likelihood of the observed data (distribution). Put another way, the EM algorithm attempts to approximate the observed distributions of values based on mixtures of different distributions in different clusters.

With the implementation of the EM algorithm in some computer programs, you may be able to select (for continuous variables) different distributions such as the normal, log-normal, and Poisson distributions. You can select different distributions for different variables and, thus, derive clusters for mixtures of different types of distributions.

Categorical variables. The EM algorithm can also accommodate categorical variables. The method will at first randomly assign different probabilities (weights, to be precise) to each class or category, for each cluster. In successive iterations, these probabilities are refined (adjusted) to maximize the likelihood of the data given the specified number of clusters.

Classification probabilities instead of classifications. The results of EM clustering are different from those computed by k-means clustering. The latter will assign observations to clusters to maximize the distances between clusters. The EM algorithm does not compute actual assignments of observations to clusters, but classification probabilities. In other words, each observation belongs to each cluster with a certain probability. Of course, as a final result you can usually review an actual assignment of observations to clusters, based on the (largest) classification probability.

## Finding the Right Number of Clusters in k-Means and EM Clustering: v-Fold Cross-Validation

An important question that needs to be answered before applying the k-means or EM clustering algorithms is how many clusters there are in the data. This is not known a priori and, in fact, there might be no definite or unique answer as to what value k should take. In other words, k is a nuisance parameter of the clustering model. Luckily, an estimate of k can be obtained from the data using the method of cross-validation. Remember that the k-means and EM methods will determine cluster solutions for a particular user-defined number of clusters. The k-means and EM clustering techniques (described above) can be optimized and enhanced for typical applications in data mining. The general metaphor of data mining implies the situation in which an analyst searches for useful structures and "nuggets" in the data, usually without any strong a priori expectations of what the analyst might find (in contrast to the hypothesis-testing approach of scientific research). In practice, the analyst usually does not know ahead of time how many clusters there might be in the sample. For that reason, some programs include an implementation of a v-fold cross-validation algorithm for automatically determining the number of clusters in the data.

This unique algorithm is immensely useful in all general "pattern-recognition" tasks - to determine the number of market segments in a marketing research study, the number of distinct spending patterns in studies of consumer behavior, the number of clusters of different medical symptoms, the number of different types (clusters) of documents in text mining, the number of weather patterns in meteorological research, the number of defect patterns on silicon wafers, etc.

The v-fold cross-validation algorithm applied to clustering. The v-fold cross-validation algorithm is described in some detail in Classification Trees and General Classification and Regression Trees (GC&RT). The general idea of this method is to divide the overall sample into a number of v folds. The same type of analysis is then successively applied to the observations belonging to the v-1 folds (training sample), and the results of the analyses are applied to sample v (the sample or fold that was not used to estimate the parameters, build the tree, determine the clusters, etc.; this is the testing sample) to compute some index of predictive validity. The results for the v replications are aggregated (averaged) to yield a single measure of the stability of the respective model, i.e., the validity of the model for predicting new observations.

Cluster analysis is an unsupervised learning technique, and we cannot observe the (real) number of clusters in the data. However, it is reasonable to replace the usual notion (applicable to supervised learning) of "accuracy" with that of "distance." In general, we can apply the v-fold cross-validation method to a range of numbers of clusters in k-means or EM clustering, and observe the resulting average distance of the observations (in the cross-validation or testing samples) from their cluster centers (for k-means clustering); for EM clustering, an appropriate equivalent measure would be the average negative (log-) likelihood computed for the observations in the testing samples.

Reviewing the results of v-fold cross-validation. The results of v-fold cross-validation are best reviewed in a simple line graph.

Shown here is the result of analyzing a data set widely known to contain three clusters of observations (specifically, the well-known Iris data file reported by Fisher, 1936, and widely referenced in the literature on discriminant function analysis). Also shown (in the graph to the right) are the results for analyzing simple normal random numbers. The "real" data (shown to the left) exhibit the characteristic scree-plot pattern (see also Factor Analysis), where the cost function (in this case, 2 times the log-likelihood of the cross-validation data, given the estimated parameters) quickly decreases as the number of clusters increases, but then (past 3 clusters) levels off, and even increases as the data are overfitted. Alternatively, the random numbers show no such pattern, in fact, there is basically no decrease in the cost function at all, and it quickly begins to increase as the number of clusters increases and overfitting occurs.

It is easy to see from this simple illustration how useful the v-fold cross-validation technique, applied to k-means and EM clustering can be for determining the "right" number of clusters in the data.

[Select Rating]

# How To Analyze Simple Two-Way and Multi-Way Table, Correspondence Analysis

## General Purpose

Correspondence analysis is a descriptive/exploratory technique designed to analyze simple two-way and multi-way tables containing some measure of correspondence between the rows and columns. The results provide information which is similar in nature to those produced by Factor Analysis techniques, and they allow you to explore the structure of categorical variables included in the table. The most common kind of table of this type is the two-way frequency crosstabulation table (see, for example, Basic Statistics or Log-Linear).

In a typical correspondence analysis, a crosstabulation table of frequencies is first standardized, so that the relative frequencies across all cells sum to 1.0. One way to state the goal of a typical analysis is to represent the entries in the table of relative frequencies in terms of the distances between individual rows and/or columns in a low-dimensional space. This is best illustrated by a simple example, which will be described below. There are several parallels in interpretation between correspondence analysis and Factor Analysis, and some similar concepts will also be pointed out below.

For a comprehensive description of this method, computational details, and its applications (in the English language), refer to the classic text by Greenacre (1984). These methods were originally developed primarily in France by Jean-Paul Benzérci in the early 1960's and 1970's (e.g., see Benzérci, 1973; see also Lebart, Morineau, and Tabard, 1977), but have only more recently gained increasing popularity in English-speaking countries (see, for example, Carrol, Green, and Schaffer, 1986; Hoffman and Franke, 1986). (Note that similar techniques were developed independently in several countries, where they were known as optimal scaling, reciprocal averaging, optimal scoring, quantification method, or homogeneity analysis). In the following paragraphs, a general introduction to correspondence analysis will be presented.

Overview. Suppose you collected data on the smoking habits of different employees in a company. The following data set is presented in Greenacre (1984, p. 55).

Smoking Category
Staff
Group
(1)
None
(2)
Light
(3)
Medium
(4)
Heavy
Row
Totals
(1) Senior Managers
(2) Junior Managers
(3) Senior Employees
(4) Junior Employees
(5) Secretaries
4
4
25
18
10
2
3
10
24
6
3
7
12
33
7
2
4
4
13
2
11
18
51
88
25
Column Totals 61 45 62 25 193

You can think of the 4 column values in each row of the table as coordinates in a 4-dimensional space, and you could compute the (Euclidean) distances between the 5 row points in the 4- dimensional space. The distances between the points in the 4-dimensional space summarize all information about the similarities between the rows in the table above. Now suppose you could find a lower-dimensional space, in which to position the row points in a manner that retains all, or almost all, of the information about the differences between the rows. You could then present all information about the similarities between the rows (types of employees in this case) in a simple 1, 2, or 3-dimensional graph. While this may not appear to be particularly useful for small tables such as the one shown above, you can easily imagine how the presentation and interpretation of very large tables (e.g., differential preference for 10 consumer items among 100 groups of respondents in a consumer survey) could greatly benefit from the simplification that can be achieved via correspondence analysis (e.g., represent the 10 consumer items in a two- dimensional space).

Mass. To continue with the simpler example of the two-way table presented above, computationally, the program will first compute the relative frequencies for the frequency table, so that the sum of all table entries is equal to 1.0 (each element will be divided by the total, i.e., 193). You could say that this table now shows how one unit of mass is distributed across the cells. In the terminology of correspondence analysis, the row and column totals of the matrix of relative frequencies are called the row mass and column mass, respectively.

Inertia. The term inertia in correspondence analysis is used by analogy with the definition in applied mathematics of "moment of inertia," which stands for the integral of mass times the squared distance to the centroid (e.g., Greenacre, 1984, p. 35). Inertia is defined as the total Pearson Chi-square for the two-way divided by the total sum (193 in the present example)..

Inertia and row and column profiles. If the rows and columns in a table are completely independent of each other, the entries in the table (distribution of mass) can be reproduced from the row and column totals alone, or row and column profiles in the terminology of correspondence analysis. According to the well-known formula for computing the Chi-square statistic for two-way tables, the expected frequencies in a table, where the column and rows are independent of each other, are equal to the respective column total times the row total, divided by the grand total. Any deviations from the expected values (expected under the hypothesis of complete independence of the row and column variables) will contribute to the overall Chi-square. Thus, another way of looking at correspondence analysis is to consider it a method for decomposing the overall Chi-square statistic (or Inertia=Chi- square/Total N) by identifying a small number of dimensions in which the deviations from the expected values can be represented. This is similar to the goal of Factor Analysis, where the total variance is decomposed, so as to arrive at a lower-dimensional representation of the variables that allows you to reconstruct most of the variance/covariance matrix of variables.

Analyzing rows and columns. This simple example began with a discussion of the row-points in the table shown above. However, you may rather be interested in the column totals, in which case you could plot the column points in a small-dimensional space, which satisfactorily reproduces the similarity (and distances) between the relative frequencies for the columns, across the rows, in the table shown above. In fact it is customary to simultaneously plot the column points and the row points in a single graph, to summarize the information contained in a two-way table.

Reviewing results. Let's now look at some of the results for the table shown above. First, shown below are the so-called singular values , eigenvalues, percentages of inertia explained, cumulative percentages, and the contribution to the overall Chi- square.

Eigenvalues and Inertia for all Dimensions
Input Table (Rows x Columns):  5 x 4
Total Inertia = .08519 Chi² = 16.442
No. of
Dims
Singular
Values
Eigen-
Values
Perc. of
Inertia
Cumulatv
Percent
Chi
Squares
1
2
3
.273421
.100086
.020337
.074759
.010017
.000414
87.75587
11.75865
.48547
87.7559
99.5145
100.0000
14.42851
1.93332
.07982

Note that the dimensions are "extracted" so as to maximize the distances between the row or column points, and successive dimensions (which are independent of or orthogonal to each other) will "explain" less and less of the overall Chi-square value (and, thus, inertia). Thus, the extraction of the dimensions is similar to the extraction of principal components in Factor Analysis.

First, it appears that, with a single dimension, 87.76% of the inertia can be "explained," that is, the relative frequency values that can be reconstructed from a single dimension can reproduce 87.76% of the total Chi-square value (and, thus, of the inertia) for this two-way table; two dimensions allow you to explain 99.51%.

Maximum number of dimensions. Since the sums of the frequencies across the columns must be equal to the row totals, and the sums across the rows equal to the column totals, there are in a sense only (no. of columns-1) independent entries in each row, and (no. of rows-1) independent entries in each column of the table (once you know what these entries are, you can fill in the rest based on your knowledge of the column and row marginal totals). Thus, the maximum number of eigenvalues that can be extracted from a two- way table is equal to the minimum of the number of columns minus 1, and the number of rows minus 1. If you choose to extract (i.e., interpret) the maximum number of dimensions that can be extracted, then you can reproduce exactly all information contained in the table.

Row and column coordinates. Next look at the coordinates for the two-dimensional solution.

Row Name Dim. 1 Dim. 2
(1) Senior Managers
(2) Junior Managers
(3) Senior Employees
(4) Junior Employees
(5) Secretaries
-.065768
.258958
-.380595
.232952
-.201089
.193737
.243305
.010660
-.057744
-.078911

Of course, you can plot these coordinates in a two-dimensional scatterplot. Remember that the purpose of correspondence analysis is to reproduce the distances between the row and/or column points in a two-way table in a lower-dimensional display; note that, as in Factor Analysis, the actual rotational orientation of the axes is arbitrarily chosen so that successive dimensions "explain" less and less of the overall Chi-square value (or inertia). You could, for example, reverse the signs in each column in the table shown above, thereby effectively rotating the respective axis in the plot by 180 degrees.

What is important are the distances of the points in the two-dimensional display, which are informative in that row points that are close to each other are similar with regard to the pattern of relative frequencies across the columns. If you have produced this plot you will see that, along the most important first axis in the plot, the Senior employees and Secretaries are relatively close together on the left side of the origin (scale position 0). If you looked at the table of relative row frequencies (i.e., frequencies standardized, so that their sum in each row is equal to 100%), you will see that these two groups of employees indeed show very similar patterns of relative frequencies across the categories of smoking intensity.

Percentages of Row Totals
Smoking Category
Staff
Group
(1)
None
(2)
Light
(3)
Medium
(4)
Heavy
Row
Totals
(1) Senior Managers
(2) Junior Managers
(3) Senior Employees
(4) Junior Employees
(5) Secretaries
36.36
22.22
49.02
20.45
40.00
18.18
16.67
19.61
27.27
24.00
27.27
38.89
23.53
37.50
28.00
18.18
22.22
7.84
14.77
8.00
100.00
100.00
100.00
100.00
100.00

Obviously the final goal of correspondence analysis is to find theoretical interpretations (i.e., meaning) for the extracted dimensions. One method that may aid in interpreting extracted dimensions is to plot the column points. Shown below are the column coordinates for the first and second dimension.

Smoking
category

Dim. 1

Dim. 2
None
Light
Medium
Heavy
-.393308
.099456
.196321
.293776
.030492
-.141064
-.007359
.197766

It appears that the first dimension distinguishes mostly between the different degrees of smoking, and in particular between category None and the others. Thus, you can interpret the greater similarity of Senior Managers with Secretaries, with regard to their position on the first axis, as mostly deriving from the relatively large numbers of None smokers in these two groups of employees.

Compatibility of row and column coordinates. It is customary to summarize the row and column coordinates in a single plot. However, it is important to remember that in such plots, you can only interpret the distances between row points, and the distances between column points, but not the distances between row points and column points.

To continue with this example, it would not be appropriate to say that the category None is similar to Senior Employees (the two points are very close in the simultaneous plot of row and column coordinates). However, as was indicated earlier, it is appropriate to make general statements about the nature of the dimensions, based on which side of the origin particular points fall. For example, because category None is the only column point on the left side of the origin for the first axis, and since employee group Senior Employees also falls onto that side of the first axis, you may conclude that the first axis separates None smokers from the other categories of smokers, and that Senior Employees are different from, for example, Junior Employees, in that there are relatively more non-smoking Senior Employees.

Scaling of the coordinates (standardization options). Another important decision that the analyst must make concerns the scaling of the coordinates. The nature of the choice pertains to whether or not you want to analyze the relative row percentages, column percentages, or both. In the context of the example described above, the row percentages were shown to illustrate how the patterns of those percentages across the columns are similar for points which appear more closely together in the graphical display of the row coordinates. Put another way, the coordinates are based on the analysis of the row profile matrix, where the sum of the table entries in a row, across all columns, is equal to 1.0 (each entry rij in the row profile matrix can be interpreted as the conditional probability that a case belongs to column j, given its membership in row i). Thus, the coordinates are computed so as to maximize the differences between the points with respect to the row profiles (row percentages). The row coordinates are computed from the row profile matrix, the column coordinates are computed from the column profile matrix.

A fourth option, Canonical standardization (see Gifi, 1981), is also provided, and it amounts to a standardization of the columns and rows of the matrix of relative frequencies. This standardization amounts to a rescaling of the coordinates based on the row profile standardization and the column profile standardization, and this type of standardization is not widely used. Note also that a variety of other custom standardizations can be easily performed if you have the raw eigenvalues and eigenvector matrices.

Metric of coordinate system. In several places in this introduction, the term distance was (loosely) used to refer to the differences between the pattern of relative frequencies for the rows across the columns, and columns across the rows, which are to be reproduced in a lower-dimensional solution as a result of the correspondence analysis. Actually, these distances represented by the coordinates in the respective space are not simple Euclidean distances computed from the relative row or column frequencies, but rather, they are weighted distances. Specifically, the weighting that is applied is such that the metric in the lower- dimensional space is a Chi-square metric, provided that (1) you are comparing row points, and chose either row-profile standardization or both row- and column-profile standardization, or (2) you are comparing column points, and chose either column-profile standardization or both row- and column-profile standardization.

In that case (but not if you chose the canonical standardization), the squared Euclidean distance between, for example, two row points i and i' in the respective coordinate system of a given number of dimensions actually approximates a weighted (i.e., Chi-square) distance between the relative frequencies (see Hoffman and Franke, 1986, formula 21):

dii '2 = j (1/cj (pij /ri - p2i ' j /ri '))

In this formula, dii ' stands for the squared distance between the two points, cj stands for the column total for the j'th column of the standardized frequency table (where the sum of all entries or mass is equal to 1.0), pij stands for the individual cell entries in the standardized frequency table (row i, column j), ri stands for the row total for the i'th column of the relative frequency table, and the summation is over the columns of the table. To reiterate, only the distances between row points, and correspondingly, between column points are interpretable in this manner; the distances between row points and column points cannot be interpreted.

Judging the quality of a solution. A number of auxiliary statistics are reported, to aid in the evaluation of the quality of the respective chosen numbers of dimensions. The general concern here is that all (or at least most) points are properly represented by the respective solution, that is, that their distances to other points can be approximated to a satisfactory degree. Shown below are all statistics reported for the row coordinates for the example table discussed so far, based on a one-dimensional solution only (i.e., only one dimension is used to reconstruct the patterns of relative frequencies across the columns).

Row Coordinates and Contributions to Inertia

Staff Group
Coordin.
Dim.1

Mass

Quality
Relative
Inertia
Inertia
Dim.1
Cosine²
Dim.1
(1) Senior Managers
(2) Junior Managers
(3) Senior Employees
(4) Junior Employees
(5) Secretaries
-.065768
.258958
-.380595
.232952
-.201089
.056995
.093264
.264249
.455959
.129534
.092232
.526400
.999033
.941934
.865346
.031376
.139467
.449750
.308354
.071053
.003298
.083659
.512006
.330974
.070064
.092232
.526400
.999033
.941934
.865346

Coordinates. The first numeric column shown in the table above contains the coordinates, as discussed in the previous paragraphs. To reiterate, the specific interpretation of these coordinates depends on the standardization chosen for the solution (see above). The number of dimensions is chosen by the user (in this case we chose only one dimension), and coordinate values will be shown for each dimension (i.e., there will be one column with coordinate values for each dimension).

Mass. The Mass column contains the row totals (since these are the row coordinates) for the table of relative frequencies (i.e., for the table where each entry is the respective mass, as discussed earlier in this section). Remember that the coordinates are computed based on the matrix of conditional probabilities shown in the Mass column.

Quality. The Quality column contains information concerning the quality of representation of the respective row point in the coordinate system defined by the respective numbers of dimensions, as chosen by the user. In the table shown above, only one dimension was chosen, and the numbers in the Quality column pertain to the quality of representation in the one-dimensional space. To reiterate, computationally, the goal of the correspondence analysis is to reproduce the distances between points in a low-dimensional space. If you extracted (i.e., interpreted) the maximum number of dimensions (which is equal to the minimum of the number of rows and the number of columns, minus 1), you could reconstruct all distances exactly. The Quality of a point is defined as the ratio of the squared distance of the point from the origin in the chosen number of dimensions, over the squared distance from the origin in the space defined by the maximum number of dimensions (remember that the metric here is Chi-square, as described earlier). By analogy to Factor Analysis, the quality of a point is similar in its interpretation to the communality for a variable in factor analysis. .

Note that the Quality measure reported is independent of the chosen method of standardization, and always pertains to the default standardization (i.e., the distance metric is Chi-square, and the quality measure can be interpreted as the "proportion of Chi- square accounted for" for the respective row, given the respective number of dimensions). A low quality means that the current number of dimensions does not well represent the respective row (or column). In the table shown above, the quality for the first row (Senior Managers) is less than .1, indicating that this row point is not well represented by the one- dimensional representation of the points.

Relative inertia. The Quality of a point (see above) represents the proportion of the contribution of that point to the overall inertia (Chi-square) that can be accounted for by the chosen number of dimensions. However, it does not indicate whether or not, and to what extent, the respective point does in fact contribute to the overall inertia (Chi- square value). The relative inertia represents the proportion of the total inertia accounted for by the respective point, and it is independent of the number of dimensions chosen by the user. Note that a particular solution may represent a point very well (high Quality), but the same point may not contribute much to the overall inertia (e.g., a row point with a pattern of relative frequencies across the columns that is similar to the average pattern across all rows).

Relative inertia for each dimension. This column contains the relative contribution of the respective (row) point to the inertia "accounted for" by the respective dimension. Thus, this value will be reported for each (row or column) point, for each dimension.

Cosine² (quality or squared correlations with each dimension). This column contains the quality for each point, by dimension. The sum of the values in these columns across the dimensions is equal to the total Quality value discussed above (since in the example table above, only one dimension was chose, the values in this column are identical to the values in the overall Quality column). This value may also be interpreted as the "correlation" of the respective point with the respective dimension. The term Cosine² refers to the fact that this value is also the squared cosine value of the angle the point makes with the respective dimension (refer to Greenacre, 1984, for details concerning the geometric aspects of correspondence analysis).

A note about "statistical significance." It should be noted at this point that correspondence analysis is an exploratory technique. Actually, the method was developed based on a philosophical orientation that emphasizes the development of models that fit the data, rather than the rejection of hypotheses based on the lack of fit (Benzecri's "second principle" states that "The model must fit the data, not vice versa;" see Greenacre, 1984, p. 10). Therefore, there are no statistical significance tests that are customarily applied to the results of a correspondence analysis; the primary purpose of the technique is to produce a simplified (low- dimensional) representation of the information in a large frequency table (or tables with similar measures of correspondence).

## Supplementary Points

The introductory section provides an overview of how to interpret the coordinates and related statistics computed in a correspondence analysis. An important aid in the interpretation of the results from a correspondence analysis is to include supplementary row or column points, that were not used to perform the original analyses. For example, consider the following results which are based on the example given in the introductory (based on Greenacre, 1984).

Row Name Dim. 1 Dim. 2
(1) Senior Managers
(2) Junior Managers
(3) Senior Employees
(4) Junior Employees
(5) Secretaries
-.065768
.258958
-.380595
.232952
-.201089
.193737
.243305
.010660
-.057744
-.078911
National Average -.258368 -.117648

The table above shows the coordinate values (for two dimensions) computed for a frequency table of different types of employees by type of smoking habit. The row labeled National Average contains the coordinate values for the supplementary point, which is the national average (percentages) for the different smoking categories (which make up the columns of the table; those fictitious percentages reported in Greenacre (1984) are: Nonsmokers: 42%, light smokers: 29%, medium smokers, 20%; heavy smokers: 9%). If you plotted these coordinates in a two-dimensional scatterplot, along with the column coordinates, it would be apparent that the National Average supplementary row point is plotted close to the point representing the Secretaries group, and on the same side of the horizontal axis (first dimension) as the Nonsmokers column point. If you refer back to the original two-way table shown in the introductory section, this finding is consistent with the entries in the table of row frequencies, that is, there are relatively more nonsmokers among the Secretaries, and in the National Average. Put another way, the sample represented in the original frequency table contains more smokers than the national average.

While this type of information could have been easily gleaned from the original frequency table (that was used as the input to the analysis), in the case of very large tables, such conclusions may not be as obvious.

Quality. Another interesting result for supplementary points concerns the quality of their representation in the chosen number of dimensions (see the introductory section for a more detailed discussion of the concept of quality). To reiterate, the goal of the correspondence analysis is to reproduce the distances between the row or column coordinates (patterns of relative frequencies across the columns or rows, respectively) in a low-dimensional solution. Given such a solution, you may ask whether particular supplementary points of interest can be represented equally well in the final space, that is, whether or not their distances from the other points in the table can also be represented in the chosen numbers of dimensions. Shown below are the summary statistics for the original points, and the supplementary row point National Average, for the two-dimensional solution of representation of representation of supplementary points.

Staff Group

Quality
Cosine²
Dim.1
Cosine²
Dim.2
(1) Senior Managers
(2) Junior Managers
(3) Senior Employees
(4) Junior Employees
(5) Secretaries
.892568
.991082
.999817
.999810
.998603
.092232
.526400
.999033
.941934
.865346
.800336
.464682
.000784
.057876
.133257
National Average .761324 .630578 .130746

The statistics reported in the table above are discussed in the introductory section. In short, the Quality of a row or column point is defined as the ratio of the squared distance of the point from the origin in the chosen number of dimensions, over the squared distance from the origin in the space defined by the maximum number of dimensions (remember that the metric here is Chi-square, as described in the introductory section). In a sense, the overall quality is the "proportion of squared distance-from-the-overall-centroid accounted for." The supplementary row point National Average has a quality of .76, indicating that it is reasonably well represented in the two-dimensional solution. The Cosine² statistic is the quality "accounted for" by the respective row point, by the respective dimension (the sum of the Cosine² values over the respective number of dimensions is equal to the total Quality, see also the introductory section).

## Multiple Correspondence Analysis (MCA)

Multiple correspondence analysis (MCA) may be considered to be an extension of simple correspondence analysis to more than two variables. For an introductory overview of simple correspondence analysis, refer to the introductory section . Multiple correspondence analysis is a simple correspondence analysis carried out on an indicator (or design) matrix with cases as rows and categories of variables as columns. Actually, one usually analyzes the inner product of such a matrix, called the Burt Table in an MCA; this will be discussed later. However, to clarify the interpretation of the results from a multiple correspondence analysis, it is easier to discuss the simple correspondence analysis of an indicator or design matrix.

Indicator or design matrix. Consider again the simple two-way table presented in the introductory section:

Smoking Category
Staff
Group
(1)
None
(2)
Light
(3)
Medium
(4)
Heavy
Row
Totals
(1) Senior Managers
(2) Junior Managers
(3) Senior Employees
(4) Junior Employees
(5) Secretaries
4
4
25
18
10
2
3
10
24
6
3
7
12
33
7
2
4
4
13
2
11
18
51
88
25
Column Totals 61 45 62 25 193

Suppose you had entered the data for this table in the following manner, as an indicator or design matrix:

Staff Group Smoking
Case
Number
Senior
Manager
Junior
Manager
Senior
Employee
Junior
Employee

Secretary

None

Light

Medium

Heavy
1
2
3
4
5
...
...
...
191
192
193
1
1
1
1
1
.
.
.
0
0
0
0
0
0
0
0
.
.
.
0
0
0
0
0
0
0
0
.
.
.
0
0
0
0
0
0
0
0
.
.
.
0
0
0
0
0
0
0
0
.
.
.
1
1
1
1
1
1
1
0
.
.
.
0
0
0
0
0
0
0
1
.
.
.
0
0
0
0
0
0
0
0
.
.
.
1
0
0
0
0
0
0
0
.
.
.
0
1
1

Each one of the 193 total cases in the table is represented by one case in this data file. For each case a 1 is entered into the category where the respective case "belongs," and a 0 otherwise. For example, case 1 represents a Senior Manager who is a None smoker. As can be seen in the table above, there are a total of 4 such cases in the two-way table, and thus there will be four cases like this in the indicator matrix. In all, there will be 193 cases in the indicator or design matrix.

Analyzing the design matrix. If you now analyzed this data file (design or indicator matrix) shown above as if it were a two-way frequency table, the results of the correspondence analysis would provide column coordinates that would allow you to relate the different categories to each other, based on the distances between the row points, i.e., between the individual cases. In fact, the two-dimensional display you would obtain for the column coordinates would look very similar to the combined display for row and column coordinates, if you had performed the simple correspondence analysis on the two-way frequency table (note that the metric will be different, but the relative positions of the points will be very similar).

More than two variables. The approach to analyzing categorical data outlined above can easily be extended to more than two categorical variables. For example, the indicator or design matrix could contain two additional variables Male and Female, again coded 0 and 1, to indicate the subjects' gender; and three variables could be added to indicate to which one of three age groups a case belongs. Thus, in the final display, you could represent the relationships (similarities) between Gender, Age, Smoking habits, and Occupation (Staff Groups).

Fuzzy coding. It is not necessary that each case is assigned exclusively to only one category of each categorical variable. Rather than the 0-or-1 coding scheme, you could enter probabilities for membership in a category, or some other measure that represents a fuzzy rule for group membership. Greenacre (1984) discusses different types of coding schemes of this kind. For example, suppose in the example design matrix shown earlier, you had missing data for a few cases regarding their smoking habits. Instead of discarding those cases entirely from the analysis (or creating a new category Missing data), you could assign to the different smoking categories proportions (which should add to 1.0) to represent the probabilities that the respective case belongs to the respective category (e.g., you could enter proportions based on your knowledge about estimates for the national averages for the different categories).

Interpretation of coordinates and other results. To reiterate, the results of a multiple correspondence analysis are identical to the results you would obtain for the column coordinates from a simple correspondence analysis of the design or indicator matrix. Therefore, the interpretation of coordinate values, quality values, cosine²'s and other statistics reported as the results from a multiple correspondence analysis can be interpreted in the same manner as described in the context of the simple correspondence analysis (see introductory section), however, these statistics pertain to the total inertia associated with the entire design matrix.

Supplementary column points and "multiple regression" for categorical variables. Another application of the analysis of design matrices via correspondence analysis techniques is that it allows you to perform the equivalent of a Multiple Regression for categorical variables, by adding supplementary columns to the design matrix. For example, suppose you added to the design matrix shown earlier two columns to indicate whether or not the respective subject had or had not been ill over the past year (i.e., you could add one column Ill and another column Not ill, and again enter 0's and 1's to indicate each subject's health status). If, in a simple correspondence analysis of the design matrix, you added those columns as supplementary columns to the analysis, then (1) the summary statistics for the quality of representation (see the introductory section) for those columns would give you an indication of how well you can "explain" illness as a function of the other variables in the design matrix, and (2) the display of the column points in the final coordinate system would provide an indication of the nature (e.g., direction) of the relationships between the columns in the design matrix and the column points indicating illness; this technique (adding supplementary points to an MCA analysis) is also sometimes called predictive mapping.

The Burt table. The actual computations in multiple correspondence analysis are not performed on a design or indicator matrix (which, potentially, may be very large if there are many cases), but on the inner product of this matrix; this matrix is also called the Burt matrix. With frequency tables, this amounts to tabulating the stacked categories against each other; for example the Burt for the two-way frequency table presented earlier would look like this.

Employee Smoking
(1) (2) (3) (4) (5) (1) (2) (3) (4)
(1) Senior Managers
(2) Junior Managers
(3) Senior Employees
(4) Junior Employees
(5) Secretaries
(1) Smoking:None
(2) Smoking:Light
(3) Smoking:Medium
(4) Smoking:Heavy
11
0
0
0
0
4
2
3
2
0
18
0
0
0
4
3
7
4
0
0
51
0
0
25
10
12
4
0
0
0
88
0
18
24
33
13
0
0
0
0
25
10
6
7
2
4
4
25
18
10
61
0
0
0
2
3
10
24
6
0
45
0
0
3
7
12
33
7
0
0
62
0
2
4
4
13
2
0
0
0
25

The Burt has a clearly defined structure. In the case of two categorical variables (shown above), it consists of 4 partitions: (1) the crosstabulation of variable Employee against itself, (2) the crosstabulation of variable Employee against variable Smoking, (3), the crosstabulation of variable Smoking against variable Employee, and (4) the crosstabulation of variable Smoking against itself. Note that the matrix is symmetrical, and that the sum of the diagonal elements in each partition representing the crosstabulation of a variable against itself must be the same (e.g., there were a total of 193 observations in the present example, and hence, the diagonal elements in the crosstabulation tables of variable Employee against itself, and Smoking against itself must also be equal to 193).

Note that the off-diagonal elements in the partitions representing the crosstabulations of a variable against itself are equal to 0 in the table shown above. However, this is not necessarily always the case, for example, when the Burt was derived from a design or indicator matrix that included fuzzy coding of category membership (see above).

## Burt Tables

The Burt table is the result of the inner product of a design or indicator matrix, and the multiple correspondence analysis results are identical to the results one would obtain for the column points from a simple correspondence analysis of the indicator or design matrix (see also MCA).

For example, suppose you had entered data concerning the Survival for different Age groups in different Locations like this:

SURVIVAL AGE LOCATION
Case No. NO YES LESST50 A50TO69 OVER69 TOKYO BOSTON GLAMORGN
1
2
3
4
...
...
...
762
763
764
0
1
0
0
.
.
.
1
0
0
1
0
1
1
.
.
.
0
1
1
0
1
0
0
.
.
.
0
1
0
1
0
1
0
.
.
.
1
0
1
0
0
0
1
.
.
.
0
0
0
0
1
0
0
.
.
.
1
0
0
0
0
1
0
.
.
.
0
1
0
1
0
0
1
.
.
.
0
0
1

In this data arrangement, for each case a 1 was entered to indicate to which category, of a particular set of categories, a case belongs (e.g., Survival, with the categories No and Yes). For example, case 1 survived (a 0 was entered for variable No, and a 1 was entered for variable Yes), case 1 is between age 50 and 69 (a 1 was entered for variable A50to69), and was observed in Glamorgn). Overall there are 764 observations in the data set.

If you denote the data (design or indicator matrix) shown above as matrix X, then matrix product X'X is a Burt table); shown below is an example of a Burt table that one might obtain in this manner.

SURVIVAL AGE LOCATION
NO YES <50 50-69 69+ TOKYO BOSTON GLAMORGN
SURVIVAL:NO
SURVIVAL:YES

AGE:UNDER_50
AGE:A_50TO69
AGE:OVER_69

LOCATION:TOKYO
LOCATION:BOSTON
LOCATION:GLAMORGN
210
0

68
93
49

60
82
68
0
554

212
258
84

230
171
153
68
212

280
0
0

151
58
71
93
258

0
351
0

120
122
109
49
84

0
0
133

19
73
41
60
230

151
120
19

290
0
0
82
171

58
122
73

0
253
0
68
153

71
109
41

0
0
221

The Burt table has a clearly defined structure. Overall, the data matrix is symmetrical. In the case of 3 categorical variables (as shown above), the data matrix consists 3 x 3 = 9 partitions, created by each variable being tabulated against itself, and against the categories of all other variables. Note that the sum of the diagonal elements in each diagonal partition (i.e., where the respective variables are tabulated against themselves) is constant (equal to 764 in this case).

The off-diagonal elements in each diagonal partition in this example are all 0. If the cases in the design or indicator matrix are assigned to categories via fuzzy coding (i.e., if probabilities are used to indicate likelihood of membership in a category, rather than 0/1 coding to indicate actual membership), then the off-diagonal elements of the diagonal partitions are not necessarily equal to 0.

[Select Rating]

# What is Data Mining (Predictive Analytics, Big Data)

## Data Mining

Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related - also known as "big data") in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications. The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation/verification, and (3) deployment (i.e., the application of the model to new data in order to generate predictions).

Stage 1: Exploration. This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and - in case of data sets with large numbers of variables ("fields") - performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered). Then, depending on the nature of the analytic problem, this first stage of the process of data mining may involve anywhere between a simple choice of straightforward predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods (see Exploratory Data Analysis (EDA)) in order to identify the most relevant variables and determine the complexity and/or the general nature of models that can be taken into account in the next stage.

Stage 2: Model building and validation. This stage involves considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal - many of which are based on so-called "competitive evaluation of models," that is, applying different models to the same data set and then comparing their performance to choose the best. These techniques - which are often considered the core of predictive data mining - include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and Meta-Learning.

Stage 3: Deployment. That final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome.

The concept of Data Mining is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty. Recently, there has been increased interest in developing new analytic techniques specifically designed to address the issues relevant to business Data Mining (e.g., Classification Trees), but Data Mining is still based on the conceptual principles of statistics including the traditional Exploratory Data Analysis (EDA) and modeling and it shares with them both some components of its general approaches and specific techniques.

However, an important general difference in the focus and purpose between Data Mining and the traditional Exploratory Data Analysis (EDA) is that Data Mining is more oriented towards applications than the basic nature of the underlying phenomena. In other words, Data Mining is relatively less concerned with identifying the specific relations between the involved variables. For example, uncovering the nature of the underlying functions or the specific types of interactive, multivariate dependencies between variables are not the main goal of Data Mining. Instead, the focus is on producing a solution that can generate useful predictions. Therefore, Data Mining accepts among others a "black box" approach to data exploration or knowledge discovery and uses not only the traditional Exploratory Data Analysis (EDA) techniques, but also such techniques as Neural Networks which can generate valid predictions but are not capable of identifying the specific nature of the interrelations between the variables on which the predictions are based.

Data Mining is often considered to be "a blend of statistics, AI (artificial intelligence), and data base research" (Pregibon, 1997, p. 8), which until very recently was not commonly recognized as a field of interest for statisticians, and was even considered by some "a dirty word in Statistics" (Pregibon, 1997, p. 8). Due to its applied importance, however, the field emerges as a rapidly growing and major area (also in statistics) where important theoretical advances are being made (see, for example, the recent annual International Conferences on Knowledge Discovery and Data Mining, co-hosted by the American Statistical Association).

For information on Data Mining techniques, review the summary topics included below. There are numerous books that review the theory and practice of data mining; the following books offer a representative sample of recent general books on data mining, representing a variety of approaches and perspectives:

Berry, M., J., A., & Linoff, G., S., (2000). Mastering data mining. New York: Wiley.

Edelstein, H., A. (1999). Introduction to data mining and knowledge discovery (3rd ed). Potomac, MD: Two Crows Corp.

Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge discovery & data mining. Cambridge, MA: MIT Press.

Han, J., Kamber, M. (2000). Data mining: Concepts and Techniques. New York: Morgan-Kaufman.

Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning : Data mining, inference, and prediction. New York: Springer.

Pregibon, D. (1997). Data Mining. Statistical Computing and Graphics, 7, 8.

Weiss, S. M., & Indurkhya, N. (1997). Predictive data mining: A practical guide. New York: Morgan-Kaufman.

Westphal, C., Blaxton, T. (1998). Data mining solutions. New York: Wiley.

Witten, I. H., & Frank, E. (2000). Data mining. New York: Morgan-Kaufmann.

## Crucial Concepts in Data Mining

Bagging (Voting, Averaging)
The concept of bagging (voting for classification, averaging for regression-type problems with continuous dependent variables of interest) applies to the area of predictive data mining, to combine the predicted classifications (prediction) from multiple models, or from the same type of model for different learning data. It is also used to address the inherent instability of results when applying complex models to relatively small data sets. Suppose your data mining task is to build a model for predictive classification, and the dataset from which to train the model (learning data set, which contains observed classifications) is relatively small. You could repeatedly sub-sample (with replacement) from the dataset, and apply, for example, a tree classifier (e.g., C&RT and CHAID) to the successive samples. In practice, very different trees will often be grown for the different samples, illustrating the instability of models often evident with small data sets. One method of deriving a single prediction (for new observations) is to use all trees found in the different samples, and to apply some simple voting: The final classification is the one most often predicted by the different trees. Note that some weighted combination of predictions (weighted vote, weighted average) is also possible, and commonly used. A sophisticated (machine learning) algorithm for generating weights for weighted prediction or voting is the Boosting procedure.

Boosting
The concept of boosting applies to the area of predictive data mining, to generate multiple models or classifiers (for prediction or classification), and to derive weights to combine the predictions from those models into a single prediction or predicted classification (see also Bagging).

A simple algorithm for boosting works like this: Start by applying some method (e.g., a tree classifier such as C&RT or CHAID) to the learning data, where each observation is assigned an equal weight. Compute the predicted classifications, and apply weights to the observations in the learning sample that are inversely proportional to the accuracy of the classification. In other words, assign greater weight to those observations that were difficult to classify (where the misclassification rate was high), and lower weights to those that were easy to classify (where the misclassification rate was low). In the context of C&RT for example, different misclassification costs (for the different classes) can be applied, inversely proportional to the accuracy of prediction in each class. Then apply the classifier again to the weighted data (or with different misclassification costs), and continue with the next iteration (application of the analysis method for classification to the re-weighted data).

Boosting will generate a sequence of classifiers, where each consecutive classifier in the sequence is an "expert" in classifying observations that were not well classified by those preceding it. During deployment (for prediction or classification of new cases), the predictions from the different classifiers can then be combined (e.g., via voting, or some weighted voting procedure) to derive a single best prediction or classification.

Note that boosting can also be applied to learning methods that do not explicitly support weights or misclassification costs. In that case, random sub-sampling can be applied to the learning data in the successive steps of the iterative boosting procedure, where the probability for selection of an observation into the subsample is inversely proportional to the accuracy of the prediction for that observation in the previous iteration (in the sequence of iterations of the boosting procedure).

CRISP
See Models for Data Mining.

Data Preparation (in Data Mining)
Data preparation and cleaning is an often neglected but extremely important step in the data mining process. The old saying "garbage-in-garbage-out" is particularly applicable to the typical data mining projects where large data sets collected via some automatic methods (e.g., via the Web) serve as the input into the analyses. Often, the method by which the data where gathered was not tightly controlled, and so the data may contain out-of-range values (e.g., Income: -100), impossible data combinations (e.g., Gender: Male, Pregnant: Yes), and the like. Analyzing data that has not been carefully screened for such problems can produce highly misleading results, in particular in predictive data mining.

Data Reduction (for Data Mining)
The term Data Reduction in the context of data mining is usually applied to projects where the goal is to aggregate or amalgamate the information contained in large datasets into manageable (smaller) information nuggets. Data reduction methods can include simple tabulation, aggregation (computing descriptive statistics) or more sophisticated techniques like clustering, principal components analysis, etc.

Deployment
The concept of deployment in predictive data mining refers to the application of a model for prediction or classification to new data. After a satisfactory model or set of models has been identified (trained) for a particular application, we usually want to deploy those models so that predictions or predicted classifications can quickly be obtained for new data. For example, a credit card company may want to deploy a trained model or set of models (e.g., neural networks, meta-learner) to quickly identify transactions which have a high probability of being fraudulent.

Drill-Down Analysis
The concept of drill-down analysis applies to the area of data mining, to denote the interactive exploration of data, in particular of large databases. The process of drill-down analyses begins by considering some simple break-downs of the data by a few variables of interest (e.g., Gender, geographic region, etc.). Various statistics, tables, histograms, and other graphical summaries can be computed for each group. Next, we may want to "drill-down" to expose and further analyze the data "underneath" one of the categorizations, for example, we might want to further review the data for males from the mid-west. Again, various statistical and graphical summaries can be computed for those cases only, which might suggest further break-downs by other variables (e.g., income, age, etc.). At the lowest ("bottom") level are the raw data: For example, you may want to review the addresses of male customers from one region, for a certain income group, etc., and to offer to those customers some particular services of particular utility to that group.

Feature Selection
One of the preliminary stage in predictive data mining, when the data set includes more variables than could be included (or would be efficient to include) in the actual model building phase (or even in initial exploratory operations), is to select predictors from a large list of candidates. For example, when data are collected via automated (computerized) methods, it is not uncommon that measurements are recorded for thousands or hundreds of thousands (or more) of predictors. The standard analytic methods for predictive data mining, such as neural network analyses, classification and regression trees, generalized linear models, or general linear models become impractical when the number of predictors exceed more than a few hundred variables.

Feature selection selects a subset of predictors from a large list of candidate predictors without assuming that the relationships between the predictors and the dependent or outcome variables of interest are linear, or even monotone. Therefore, this is used as a pre-processor for predictive data mining, to select manageable sets of predictors that are likely related to the dependent (outcome) variables of interest, for further analyses with any of the other methods for regression and classification.

Machine Learning
Machine learning, computational learning theory, and similar terms are often used in the context of Data Mining, to denote the application of generic model-fitting or classification algorithms for predictive data mining. Unlike traditional statistical data analysis, which is usually concerned with the estimation of population parameters by statistical inference, the emphasis in data mining (and machine learning) is usually on the accuracy of prediction (predicted classification), regardless of whether or not the "models" or techniques that are used to generate the prediction is interpretable or open to simple explanation. Good examples of this type of technique often applied to predictive data mining are neural networks or meta-learning techniques such as boosting, etc. These methods usually involve the fitting of very complex "generic" models, that are not related to any reasoning or theoretical understanding of underlying causal processes; instead, these techniques can be shown to generate accurate predictions or classification in crossvalidation samples.

Meta-Learning
The concept of meta-learning applies to the area of predictive data mining, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different. In this context, this procedure is also referred to as Stacking (Stacked Generalization).

Suppose your data mining project includes tree classifiers, such as C&RT and CHAID, linear discriminant analysis (e.g., see GDA), and Neural Networks. Each computes predicted classifications for a crossvalidation sample, from which overall goodness-of-fit statistics (e.g., misclassification rates) can be computed. Experience has shown that combining the predictions from multiple methods often yields more accurate predictions than can be derived from any one method (e.g., see Witten and Frank, 2000). The predictions from different classifiers can be used as input into a meta-learner, which will attempt to combine the predictions to create a final best predicted classification. So, for example, the predicted classifications from the tree classifiers, linear model, and the neural network classifier(s) can be used as input variables into a neural network meta-classifier, which will attempt to "learn" from the data how to combine the predictions from the different models to yield maximum classification accuracy.

We can apply meta-learners to the results from different meta-learners to create "meta-meta"-learners, and so on; however, in practice such exponential increase in the amount of data processing, in order to derive an accurate prediction, will yield less and less marginal utility.

Models for Data Mining
In the business environment, complex data mining projects may require the coordinate efforts of various experts, stakeholders, or departments throughout an entire organization. In the data mining literature, various "general frameworks" have been proposed to serve as blueprints for how to organize the process of gathering data, analyzing data, disseminating results, implementing results, and monitoring improvements.

One such model, CRISP (Cross-Industry Standard Process for data mining) was proposed in the mid-1990s by a European consortium of companies to serve as a non-proprietary standard process model for data mining. This general approach postulates the following (perhaps not particularly controversial) general sequence of steps for data mining projects:

Another approach - the Six Sigma methodology - is a well-structured, data-driven methodology for eliminating defects, waste, or quality control problems of all kinds in manufacturing, service delivery, management, and other business activities. This model has recently become very popular (due to its successful implementations) in various American industries, and it appears to gain favor worldwide. It postulated a sequence of, so-called, DMAIC steps -

- that grew up from the manufacturing, quality improvement, and process control traditions and is particularly well suited to production environments (including "production of services," i.e., service industries).

Another framework of this kind (actually somewhat similar to Six Sigma) is the approach proposed by SAS Institute called SEMMA -

- which is focusing more on the technical activities typically involved in a data mining project.

All of these models are concerned with the process of how to integrate data mining methodology into an organization, how to "convert data into information," how to involve important stake-holders, and how to disseminate the information in a form that can easily be converted by stake-holders into resources for strategic decision making.

Some software tools for data mining are specifically designed and documented to fit into one of these specific frameworks.

The general underlying philosophy of StatSoft's STATISTICA Data Miner is to provide a flexible data mining workbench that can be integrated into any organization, industry, or organizational culture, regardless of the general data mining process-model that the organization chooses to adopt. For example, STATISTICA Data Miner can include the complete set of (specific) necessary tools for ongoing company wide Six Sigma quality control efforts, and users can take advantage of its (still optional) DMAIC-centric user interface for industrial data mining tools. It can equally well be integrated into ongoing marketing research, CRM (Customer Relationship Management) projects, etc. that follow either the CRISP or SEMMA approach - it fits both of them perfectly well without favoring either one. Also, STATISTICA Data Miner offers all the advantages of a general data mining oriented "development kit" that includes easy to use tools for incorporating into your projects not only such components as custom database gateway solutions, prompted interactive queries, or proprietary algorithms, but also systems of access privileges, workgroup management, and other collaborative work tools that allow you to design large scale, enterprise-wide systems (e.g., following the CRISP, SEMMA, or a combination of both models) that involve your entire organization.

Predictive Data Mining
The term Predictive Data Mining is usually applied to identify data mining projects with the goal to identify a statistical or neural network model or set of models that can be used to predict some response of interest. For example, a credit card company may want to engage in predictive data mining, to derive a (trained) model or set of models (e.g., neural networks, meta-learner) that can quickly identify transactions which have a high probability of being fraudulent. Other types of data mining projects may be more exploratory in nature (e.g., to identify cluster or segments of customers), in which case drill-down descriptive and exploratory methods would be applied. Data reduction is another possible objective for data mining (e.g., to aggregate or amalgamate the information in very large data sets into useful and manageable chunks).

SEMMA
See Models for Data Mining.

Stacked Generalization
See Stacking.

Stacking (Stacked Generalization)
The concept of stacking (Stacked Generalization) applies to the area of predictive data mining, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different.

Suppose your data mining project includes tree classifiers, such as C&RT or CHAID, linear discriminant analysis (e.g., see GDA), and Neural Networks. Each computes predicted classifications for a crossvalidation sample, from which overall goodness-of-fit statistics (e.g., misclassification rates) can be computed. Experience has shown that combining the predictions from multiple methods often yields more accurate predictions than can be derived from any one method (e.g., see Witten and Frank, 2000). In stacking, the predictions from different classifiers are used as input into a meta-learner, which attempts to combine the predictions to create a final best predicted classification. So, for example, the predicted classifications from the tree classifiers, linear model, and the neural network classifier(s) can be used as input variables into a neural network meta-classifier, which will attempt to "learn" from the data how to combine the predictions from the different models to yield maximum classification accuracy.

Other methods for combining the prediction from multiple models or methods (e.g., from multiple datasets used for learning) are Boosting and Bagging (Voting).

Text Mining
While Data Mining is typically concerned with the detection of patterns in numeric data, very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data, text is often amorphous, and difficult to deal with. Text mining generally consists of the analysis of (multiple) text documents by extracting key phrases, concepts, etc. and the preparation of the text processed in that manner for further analyses with numeric data mining techniques (e.g., to determine co-occurrences of concepts, key phrases, names, addresses, product names, etc.).

Voting
See Bagging.

## Data Warehousing

StatSoft defines data warehousing as a process of organizing the storage of large, multivariate data sets in a way that facilitates the retrieval of information for analytic purposes.

The most efficient data warehousing architecture will be capable of incorporating or at least referencing all data available in the relevant enterprise-wide information management systems, using designated technology suitable for corporate data base management (e.g., Oracle, Sybase, MS SQL Server. Also, a flexible, high-performance (see the IDP technology), open architecture approach to data warehousing - that flexibly integrates with the existing corporate systems and allows the users to organize and efficiently reference for analytic purposes enterprise repositories of data of practically any complexity - is offered in StatSoft enterprise systems such as STATISTICA Enterprise and STATISTICA Enterprise/QC , which can also work in conjunction with STATISTICA Data Miner and STATISTICA Enterprise Server.

## On-Line Analytic Processing (OLAP)

The term On-Line Analytic Processing - OLAP (or Fast Analysis of Shared Multidimensional Information - FASMI) refers to technology that allows users of multidimensional databases to generate on-line descriptive or comparative summaries ("views") of data and other analytic queries. Note that despite its name, analyses referred to as OLAP do not need to be performed truly "on-line" (or in real-time); the term applies to analyses of multidimensional databases (that may, obviously, contain dynamically updated information) through efficient "multidimensional" queries that reference various types of data. OLAP facilities can be integrated into corporate (enterprise-wide) database systems and they allow analysts and managers to monitor the performance of the business (e.g., such as various aspects of the manufacturing process or numbers and types of completed transactions at different locations) or the market. The final result of OLAP techniques can be very simple (e.g., frequency tables, descriptive statistics, simple cross-tabulations) or more complex (e.g., they may involve seasonal adjustments, removal of outliers, and other forms of cleaning the data). Although Data Mining techniques can operate on any kind of unprocessed or even unstructured information, they can also be applied to the data views and summaries generated by OLAP to provide more in-depth and often more multidimensional knowledge. In this sense, Data Mining techniques could be considered to represent either a different analytic approach (serving different purposes than OLAP) or as an analytic extension of OLAP.

## Exploratory Data Analysis (EDA)

### EDA vs. Hypothesis Testing

As opposed to traditional hypothesis testing designed to verify a priori hypotheses about relations between variables (e.g., "There is a positive correlation between the AGE of a person and his/her RISK TAKING disposition"), exploratory data analysis (EDA) is used to identify systematic relations between variables when there are no (or not complete) a priori expectations as to the nature of those relations. In a typical exploratory data analysis process, many variables are taken into account and compared, using a variety of techniques in the search for systematic patterns.

### Computational EDA techniques

Computational exploratory data analysis methods include both simple basic statistics and more advanced, designated multivariate exploratory techniques designed to identify patterns in multivariate data sets.

Basic statistical exploratory methods. The basic statistical exploratory methods include such techniques as examining distributions of variables (e.g., to identify highly skewed or non-normal, such as bi-modal patterns), reviewing large correlation matrices for coefficients that meet certain thresholds (see example above), or examining multi-way frequency tables (e.g., "slice by slice" systematically reviewing combinations of levels of control variables).

Multivariate exploratory techniques. Multivariate exploratory techniques designed specifically to identify patterns in multivariate (or univariate, such as sequences of measurements) data sets include: Cluster Analysis, Factor Analysis, Discriminant Function Analysis, Multidimensional Scaling, Log-linear Analysis, Canonical Correlation, Stepwise Linear and Nonlinear (e.g., Logit) Regression, Correspondence Analysis, Time Series Analysis, and Classification Trees.

Neural Networks. Neural Networks are analytic techniques modeled after the (hypothesized) processes of learning in the cognitive system and the neurological functions of the brain and capable of predicting new observations (on specific variables) from other observations (on the same or other variables) after executing a process of so-called learning from existing data.

### Graphical (Data Visualization) EDA Techniques

A large selection of powerful exploratory data analytic techniques is also offered by graphical data visualization methods that can identify relations, trends, and biases "hidden" in unstructured data sets.

Brushing. Perhaps the most common and historically first widely used technique explicitly identified as graphical exploratory data analysis is brushing, an interactive method allowing us to select on-screen specific data points or subsets of data and identify their (e.g., common) characteristics, or to examine their effects on relations between relevant variables. Those relations between variables can be visualized by fitted functions (e.g., 2D lines or 3D surfaces) and their confidence intervals, thus, for example, we can examine changes in those functions by interactively (temporarily) removing or adding specific subsets of data. For example, one of many applications of the brushing technique is to select (i.e., highlight) in a matrix scatterplot all data points that belong to a certain category (e.g., a "medium" income level, see the highlighted subset in the fourth component graph of the first row in the illustration left) in order to examine how those specific observations contribute to relations between other variables in the same data set (e.g, the correlation between the "debt" and "assets" in the current example). If the brushing facility supports features such as "animated brushing" or "automatic function re-fitting," we can define a dynamic brush that would move over the consecutive ranges of a criterion variable (e.g., "income" measured on a continuous scale or a discrete [3-level] scale as on the illustration above) and examine the dynamics of the contribution of the criterion variable to the relations between other relevant variables in the same data set.

Other graphical EDA techniques. Other graphical exploratory analytic techniques include function fitting and plotting, data smoothing, overlaying and merging of multiple displays, categorizing data, splitting/merging subsets of data in graphs, aggregating data in graphs, identifying and marking subsets of data that meet specific conditions, icon plots,

shading, plotting confidence intervals and confidence areas (e.g., ellipses),

generating tessellations, spectral planes,

integrated layered compressions,

with animated stratification (cross-sections) of 3D displays, and selective highlighting of specific series and blocks of data.

### Verification of Results of EDA

The exploration of data can only serve as the first stage of data analysis and its results can be treated as tentative at best as long as they are not confirmed, e.g., crossvalidated, using a different data set (or an independent subset). If the result of the exploratory stage suggests a particular model, then its validity can be verified by applying it to a new data set and testing its fit (e.g., testing its predictive validity). Case selection conditions can be used to quickly define subsets of data (e.g., for estimation and verification), and for testing the robustness of results.

## Neural Networks

Neural Networks are analytic techniques modeled after the (hypothesized) processes of learning in the cognitive system and the neurological functions of the brain and capable of predicting new observations (on specific variables) from other observations (on the same or other variables) after executing a process of so-called learning from existing data. Neural Networks is one of the Data Mining techniques.

The first step is to design a specific network architecture (that includes a specific number of "layers" each consisting of a certain number of "neurons"). The size and structure of the network needs to match the nature (e.g., the formal complexity) of the investigated phenomenon. Because the latter is obviously not known very well at this early stage, this task is not easy and often involves multiple "trials and errors." (Now, there is, however, neural network software that applies artificial intelligence techniques to aid in that tedious task and finds "the best" network architecture.)

The new network is then subjected to the process of "training." In that phase, neurons apply an iterative process to the number of inputs (variables) to adjust the weights of the network in order to optimally predict (in traditional terms, we could say find a "fit" to) the sample data on which the "training" is performed. After the phase of learning from an existing data set, the new network is ready and it can then be used to generate predictions.

The resulting "network" developed in the process of "learning" represents a pattern detected in the data. Thus, in this approach, the "network" is the functional equivalent of a model of relations between variables in the traditional model building approach. However, unlike in the traditional models, in the "network," those relations cannot be articulated in the usual terms used in statistics or methodology to describe relations between variables (such as, for example, "A is positively correlated with B but only for observations where the value of C is low and D is high"). Some neural networks can produce highly accurate predictions; they represent, however, a typical a-theoretical (one can say, "a black box") research approach. That approach is concerned only with practical considerations, that is, with the predictive validity of the solution and its applied relevance and not with the nature of the underlying mechanism or its relevance for any "theory" of the underlying phenomena.

However, it should be mentioned that Neural Network techniques can also be used as a component of analyses designed to build explanatory models because Neural Networks can help explore data sets in search for relevant variables or groups of variables; the results of such explorations can then facilitate the process of model building. Moreover, now there is neural network software that uses sophisticated algorithms to search for the most relevant input variables, thus potentially contributing directly to the model building process.

One of the major advantages of neural networks is that, theoretically, they are capable of approximating any continuous function, and thus the researcher does not need to have any hypotheses about the underlying model, or even to some extent, which variables matter. An important disadvantage, however, is that the final solution depends on the initial conditions of the network, and, as stated before, it is virtually impossible to "interpret" the solution in traditional, analytic terms, such as those used to build theories that explain phenomena.

Some authors stress the fact that neural networks use, or we should say are expected to use, massively parallel computation models. For example Haykin (1994) defines neural network as:

"a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects: (1) Knowledge is acquired by the network through a learning process, and (2) Interneuron connection strengths known as synaptic weights are used to store the knowledge." (p. 2).

However, as Ripley (1996) points out, the vast majority of contemporary neural network applications run on single-processor computers and he argues that a large speed-up can be achieved not only by developing software that will take advantage of multiprocessor hardware by also by designing better (more efficient) learning algorithms.

Neural networks is one of the methods used in Data Mining; see also Exploratory Data Analysis. For more information on neural networks, see Haykin (1994), Masters (1995), Ripley (1996), and Welstead (1994). For a discussion of neural networks as statistical tools, see Warner and Misra (1996). See also, STATISTICA Neural Networks.

Related link: StatSoft provides Data Mining and Predictive Analytics software and services. If you work for a business or government, you can request a STATISTICA Data Miner software trial. StatSoft Sales will contact you and discuss the options.