Statistica is now part of TIBCO. Beginning June 5th, Statistica customers will need to use TIBCO Support's customer portal, to receive assistance with Statistica.
In this introduction, we will briefly discuss those elementary statistical concepts that provide the necessary foundations for more specialized expertise in any area of statistical data analysis. The selected topics illustrate the basic assumptions of most statistical methods and/or have been demonstrated in research to be necessary components of our general understanding of the "quantitative nature" of reality (Nisbett, et al., 1987). We will focus mostly on the functional aspects of the concepts discussed and the presentation will be very short. Further information on each of the concepts can be found in statistical textbooks. Recommended introductory textbooks are: Kachigan (1986), and Runyon and Haber (1976); for a more advanced discussion of elementary theory and assumptions of statistics, see the classic books by Hays (1988), and Kendall and Stuart (1979).
Variables are things that we measure, control, or manipulate in research. They differ in many respects, most notably in the role they are given in our research and in the type of measures that can be applied to them.
Most empirical research belongs clearly to one of these two general categories. In correlational research, we do not (or at least try not to) influence any variables but only measure them and look for relations (correlations) between some set of variables, such as blood pressure and cholesterol level. In experimental research, we manipulate some variables and then measure the effects of this manipulation on other variables. For example, a researcher might artificially increase blood pressure and then record cholesterol level. Data analysis in experimental research also comes down to calculating "correlations" between variables, specifically, those manipulated and those affected by the manipulation. However, experimental data may potentially provide qualitatively better information: only experimental data can conclusively demonstrate causal relations between variables. For example, if we found that whenever we change variable A then variable B changes, then we can conclude that "A influences B." Data from correlational research can only be "interpreted" in causal terms based on some theories that we have, but correlational data cannot conclusively prove causality.
Independent variables are those that are manipulated whereas dependent variables are only measured or registered. This distinction appears terminologically confusing to many because, as some students say, "all variables depend on something." However, once you get used to this distinction, it becomes indispensable. The terms dependent and independent variable apply mostly to experimental research where some variables are manipulated, and in this sense they are "independent" from the initial reaction patterns, features, intentions, etc. of the subjects. Some other variables are expected to be "dependent" on the manipulation or experimental conditions. That is to say, they depend on "what the subject will do" in response. Somewhat contrary to the nature of this distinction, these terms are also used in studies where we do not literally manipulate independent variables, but only assign subjects to "experimental groups" based on some pre-existing properties of the subjects. For example, if in an experiment, males are compared to females regarding their white cell count (WCC), Gender could be called the independent variable and WCC the dependent variable.
Variables differ in how well they can be measured, i.e., in how much measurable information their measurement scale can provide. There is obviously some measurement error involved in every measurement, which determines the amount of information that we can obtain. Another factor that determines the amount of information that can be provided by a variable is its type of measurement scale. Specifically, variables are classified as (a) nominal, (b) ordinal, (c) interval, or (d) ratio.
Regardless of their type, two or more variables are related if, in a sample of observations, the values of those variables are distributed in a consistent manner. In other words, variables are related if their values systematically correspond to each other for these observations. For example, Gender and WCC would be considered to be related if most males had high WCC and most females low WCC, or vice versa; Height is related to Weight because, typically, tall individuals are heavier than short ones; IQ is related to the Number of Errors in a test if people with higher IQ's make fewer errors.
Generally speaking, the ultimate goal of every research or scientific analysis is to find relations between variables. The philosophy of science teaches us that there is no other way of representing "meaning" except in terms of relations between some quantities or qualities; either way involves relations between variables. Thus, the advancement of science must always involve finding new relations between variables. Correlational research involves measuring such relations in the most straightforward manner. However, experimental research is not any different in this respect. For example, the above mentioned experiment comparing WCC in males and females can be described as looking for a correlation between two variables: Gender and WCC. Statistics does nothing else but help us evaluate relations between variables. Actually, all of the hundreds of procedures that are described in this online textbook can be interpreted in terms of evaluating various kinds of inter-variable relations.
The two most elementary formal properties of every relation between variables are the relation's (a) magnitude (or "size") and (b) its reliability (or "truthfulness").
The statistical significance of a result is the probability that the observed relationship (e.g., between variables) or a difference (e.g., between means) in a sample occurred by pure chance ("luck of the draw"), and that in the population from which the sample was drawn, no such relationship or differences exist. Using less technical terms, we could say that the statistical significance of a result tells us something about the degree to which the result is "true" (in the sense of being "representative of the population").
More technically, the value of the p-value represents a decreasing index of the reliability of a result (see Brownlee, 1960). The higher the p-value, the less we can believe that the observed relation between variables in the sample is a reliable indicator of the relation between the respective variables in the population. Specifically, the p-value represents the probability of error that is involved in accepting our observed result as valid, that is, as "representative of the population." For example, a p-value of .05 (i.e.,1/20) indicates that there is a 5% probability that the relation between the variables found in our sample is a "fluke." In other words, assuming that in the population there was no relation between those variables whatsoever, and we were repeating experiments such as ours one after another, we could expect that approximately in every 20 replications of the experiment there would be one in which the relation between the variables in question would be equal or stronger than in ours. (Note that this is not the same as saying that, given that there IS a relationship between the variables, we can expect to replicate the results 5% of the time or 95% of the time; when there is a relationship between the variables in the population, the probability of replicating the study and finding that relationship is related to the statistical power of the design. See also, Power Analysis). In many areas of research, the p-value of .05 is customarily treated as a "border-line acceptable" error level.
There is no way to avoid arbitrariness in the final decision as to what level of significance will be treated as really "significant." That is, the selection of some level of significance, up to which the results will be rejected as invalid, is arbitrary. In practice, the final decision usually depends on whether the outcome was predicted a priori or only found post hoc in the course of many analyses and comparisons performed on the data set, on the total amount of consistent supportive evidence in the entire data set, and on "traditions" existing in the particular area of research. Typically, in many sciences, results that yield p .05 are considered borderline statistically significant, but remember that this level of significance still involves a pretty high probability of error (5%). Results that are significant at the p .01 level are commonly considered statistically significant, and p .005 or p .001 levels are often called "highly" significant. But remember that these classifications represent nothing else but arbitrary conventions that are only informally based on general research experience.
Needless to say, the more analyses you perform on a data set, the more results will meet "by chance" the conventional significance level. For example, if you calculate correlations between ten variables (i.e., 45 different correlation coefficients), then you should expect to find by chance that about two (i.e., one in every 20) correlation coefficients are significant at the p .05 level, even if the values of the variables were totally random and those variables do not correlate in the population. Some statistical methods that involve many comparisons and, thus, a good chance for such errors include some "correction" or adjustment for the total number of comparisons. However, many statistical methods (especially simple exploratory data analyses) do not offer any straightforward remedies to this problem. Therefore, it is up to the researcher to carefully evaluate the reliability of unexpected findings. Many examples in this online textbook offer specific advice on how to do this; relevant information can also be found in most research methods textbooks.
We said before that strength and reliability are two different features of relationships between variables. However, they are not totally independent. In general, in a sample of a particular size, the larger the magnitude of the relation between variables, the more reliable the relation (see the next paragraph).
Assuming that there is no relation between the respective variables in the population, the most likely outcome would be also finding no relation between these variables in the research sample. Thus, the stronger the relation found in the sample, the less likely it is that there is no corresponding relation in the population. As you see, the magnitude and significance of a relation appear to be closely related, and we could calculate the significance from the magnitude and vice-versa; however, this is true only if the sample size is kept constant, because the relation of a given strength could be either highly significant or not significant at all, depending on the sample size (see the next paragraph).
If there are very few observations, then there are also respectively few possible combinations of the values of the variables and, thus, the probability of obtaining by chance a combination of those values indicative of a strong relation is relatively high.
Consider the following illustration. If we are interested in two variables (Gender: male/female and WCC: high/low), and there are only four subjects in our sample (two males and two females), then the probability that we will find, purely by chance, a 100% relation between the two variables can be as high as one-eighth. Specifically, there is a one-in-eight chance that both males will have a high WCC and both females a low WCC, or vice versa.
Now consider the probability of obtaining such a perfect match by chance if our sample consisted of 100 subjects; the probability of obtaining such an outcome by chance would be practically zero.
Let's look at a more general example. Imagine a theoretical population in which the average value of WCC in males and females is exactly the same. Needless to say, if we start replicating a simple experiment by drawing pairs of samples (of males and females) of a particular size from this population and calculating the difference between the average WCC in each pair of samples, most of the experiments will yield results close to 0. However, from time to time, a pair of samples will be drawn where the difference between males and females will be quite different from 0. How often will it happen? The smaller the sample size in each experiment, the more likely it is that we will obtain such erroneous results, which in this case would be results indicative of the existence of a relation between Gender and WCC obtained from a population in which such a relation does not exist.
Consider this example from research on statistical reasoning (Nisbett, et al., 1987). There are two hospitals: in the first one, 120 babies are born every day; in the other, only 12. On average, the ratio of baby boys to baby girls born every day in each hospital is 50/50. However, one day, in one of those hospitals, twice as many baby girls were born as baby boys. In which hospital was it more likely to happen? The answer is obvious for a statistician, but as research shows, not so obvious for a lay person: it is much more likely to happen in the small hospital. The reason for this is that technically speaking, the probability of a random deviation of a particular size (from the population mean), decreases with the increase in the sample size.
The examples in the previous paragraphs indicate that if a relationship between variables in question is "objectively" (i.e., in the population) small, then there is no way to identify such a relation in a study unless the research sample is correspondingly large. Even if our sample is in fact "perfectly representative," the effect will not be statistically significant if the sample is small. Analogously, if a relation in question is "objectively" very large, then it can be found to be highly significant even in a study based on a very small sample.
Consider this additional illustration. If a coin is slightly asymmetrical and, when tossed, is somewhat more likely to produce heads than tails (e.g., 60% vs. 40%), then ten tosses would not be sufficient to convince anyone that the coin is asymmetrical even if the outcome obtained (six heads and four tails) was perfectly representative of the bias of the coin. However, is it so that 10 tosses is not enough to prove anything? No; if the effect in question were large enough, then ten tosses could be quite enough. For instance, imagine now that the coin is so asymmetrical that no matter how you toss it, the outcome will be heads. If you tossed such a coin ten times and each toss produced heads, most people would consider it sufficient evidence that something is wrong with the coin. In other words, it would be considered convincing evidence that in the theoretical population of an infinite number of tosses of this coin, there would be more heads than tails. Thus, if a relation is large, then it can be found to be significant even in a small sample.
The smaller the relation between variables, the larger the sample size that is necessary to prove it significant. For example, imagine how many tosses would be necessary to prove that a coin is asymmetrical if its bias were only .000001%! Thus, the necessary minimum sample size increases as the magnitude of the effect to be demonstrated decreases. When the magnitude of the effect approaches 0, the necessary sample size to conclusively prove it approaches infinity. That is to say, if there is almost no relation between two variables, then the sample size must be almost equal to the population size, which is assumed to be infinitely large. Statistical significance represents the probability that a similar outcome would be obtained if we tested the entire population. Thus, everything that would be found after testing the entire population would be, by definition, significant at the highest possible level, and this also includes all "no relation" results.
There are very many measures of the magnitude of relationships between variables that have been developed by statisticians; the choice of a specific measure in given circumstances depends on the number of variables involved, measurement scales used, nature of the relations, etc. Almost all of them, however, follow one general principle: they attempt to somehow evaluate the observed relation by comparing it to the "maximum imaginable relation" between those specific variables.
Technically speaking, a common way to perform such evaluations is to look at how differentiated the values are of the variables, and then calculate what part of this "overall available differentiation" is accounted for by instances when that differentiation is "common" in the two (or more) variables in question. Speaking less technically, we compare "what is common in those variables" to "what potentially could have been common if the variables were perfectly related."
Let's consider a simple illustration. Let's say that in our sample, the average index of WCC is 100 in males and 102 in females. Thus, we could say that on average, the deviation of each individual score from the grand mean (101) contains a component due to the gender of the subject; the size of this component is 1. That value, in a sense, represents some measure of relation between Gender and WCC. However, this value is a very poor measure because it does not tell us how relatively large this component is given the "overall differentiation" of WCC scores. Consider two extreme possibilities:
Because the ultimate goal of most statistical tests is to evaluate relations between variables, most statistical tests follow the general format that was explained in the previous paragraph. Technically speaking, they represent a ratio of some measure of the differentiation common in the variables in question to the overall differentiation of those variables. For example, they represent a ratio of the part of the overall differentiation of the WCC scores that can be accounted for by gender to the overall differentiation of the WCC scores. This ratio is usually called a ratio of explained variation to total variation. In statistics, the term explained variation does not necessarily imply that we "conceptually understand" it. It is used only to denote the common variation in the variables in question, that is, the part of variation in one variable that is "explained" by the specific values of the other variable, and vice versa.
Let's assume that we have already calculated a measure of a relation between two variables (as explained above). The next question is "how significant is this relation?" For example, is 40% of the explained variance between the two variables enough to consider the relation significant? The answer is "it depends."
Specifically, the significance depends mostly on the sample size. As explained before, in very large samples, even very small relations between variables will be significant, whereas in very small samples even very large relations cannot be considered reliable (significant). Thus, in order to determine the level of statistical significance, we need a function that represents the relationship between "magnitude" and "significance" of relations between two variables, depending on the sample size. The function we need would tell us exactly "how likely it is to obtain a relation of a given magnitude (or larger) from a sample of a given size, assuming that there is no such relation between those variables in the population." In other words, that function would give us the significance (p) level, and it would tell us the probability of error involved in rejecting the idea that the relation in question does not exist in the population. This "alternative" hypothesis (that there is no relation in the population) is usually called the null hypothesis. It would be ideal if the probability function was linear and, for example, only had different slopes for different sample sizes. Unfortunately, the function is more complex and is not always exactly the same; however, in most cases we know its shape and can use it to determine the significance levels for our findings in samples of a particular size. Most of these functions are related to a general type of function, which is called normal.
The "normal distribution" is important because in most cases, it well approximates the function that was introduced in the previous paragraph (for a detailed illustration, see Are All Test Statistics Normally Distributed?). The distribution of many test statistics is normal or follows some form that can be derived from the normal distribution. In this sense, philosophically speaking, the normal distribution represents one of the empirically verified elementary "truths about the general nature of reality," and its status can be compared to the one of fundamental laws of natural sciences. The exact shape of the normal distribution (the characteristic "bell curve") is defined by a function that has only two parameters: mean and standard deviation.
A characteristic property of the normal distribution is that 68% of all of its observations fall within a range of ±1 standard deviation from the mean, and a range of ±2 standard deviations includes 95% of the scores. In other words, in a normal distribution, observations that have a standardized value of less than -2 or more than +2 have a relative frequency of 5% or less. (Standardized value means that a value is expressed in terms of its difference from the mean, divided by the standard deviation.) If you have access to STATISTICA, you can explore the exact values of probability associated with different values in the normal distribution using the interactive Probability Calculator tool; for example, if you enter the Z value (i.e., standardized value) of 4, the associated probability computed by STATISTICA will be less than .0001, because in the normal distribution almost all observations (i.e., more than 99.99%) fall within the range of ±4 standard deviations. The animation below shows the tail area associated with other Z values.
Recall the example discussed above, where pairs of samples of males and females were drawn from a population in which the average value of WCC in males and females was exactly the same. Although the most likely outcome of such experiments (one pair of samples per experiment) was that the difference between the average WCC in males and females in each pair is close to zero, from time to time, a pair of samples will be drawn where the difference between males and females is quite different from 0. How often does it happen? If the sample size is large enough, the results of such replications are "normally distributed" (this important principle is explained and illustrated in the next paragraph) and, thus, knowing the shape of the normal curve, we can precisely calculate the probability of obtaining "by chance" outcomes representing various levels of deviation from the hypothetical population mean of 0. If such a calculated probability is so low that it meets the previously accepted criterion of statistical significance, then we have only one choice: conclude that our result gives a better approximation of what is going on in the population than the "null hypothesis" (remember that the null hypothesis was considered only for "technical reasons" as a benchmark against which our empirical result was evaluated). Note that this entire reasoning is based on the assumption that the shape of the distribution of those "replications" (technically, the "sampling distribution") is normal. This assumption is discussed in the next paragraph.
Not all, but most of them are either based on the normal distribution directly or on distributions that are related to and can be derived from normal, such as t, F, or Chi-square. Typically, these tests require that the variables analyzed are themselves normally distributed in the population, that is, they meet the so-called "normality assumption." Many observed variables actually are normally distributed, which is another reason why the normal distribution represents a "general feature" of empirical reality. The problem may occur when we try to use a normal distribution-based test to analyze data from variables that are themselves not normally distributed (see tests of normality in Nonparametrics or ANOVA/MANOVA). In such cases, we have two general choices. First, we can use some alternative "nonparametric" test (or so-called "distribution-free test" see, Nonparametrics); but this is often inconvenient because such tests are typically less powerful and less flexible in terms of types of conclusions that they can provide. Alternatively, in many cases we can still use the normal distribution-based test if we only make sure that the size of our samples is large enough. The latter option is based on an extremely important principle that is largely responsible for the popularity of tests that are based on the normal function. Namely, as the sample size increases, the shape of the sampling distribution (i.e., distribution of a statistic from the sample; this term was first used by Fisher, 1928a) approaches normal shape, even if the distribution of the variable in question is not normal. This principle is illustrated in the following animation showing a series of sampling distributions (created with gradually increasing sample sizes of: 2, 5, 10, 15, and 30) using a variable that is clearly non-normal in the population, that is, the distribution of its values is clearly skewed.
However, as the sample size (of samples used to create the sampling distribution of the mean) increases, the shape of the sampling distribution becomes normal. Note that for n=30, the shape of that distribution is "almost" perfectly normal (see the close match of the fit). This principle is called the central limit theorem (this term was first used by Pólya, 1920; German, "Zentraler Grenzwertsatz").
Although many of the statements made in the preceding paragraphs can be proven mathematically, some of them do not have theoretical proof and can be demonstrated only empirically, via so-called Monte-Carlo experiments. In these experiments, large numbers of samples are generated by a computer following predesigned specifications, and the results from such samples are analyzed using a variety of tests. This way we can empirically evaluate the type and magnitude of errors or biases to which we are exposed when certain theoretical assumptions of the tests we are using are not met by our data. Specifically, Monte-Carlo studies were used extensively with normal distribution-based tests to determine how sensitive they are to violations of the assumption of normal distribution of the analyzed variables in the population. The general conclusion from these studies is that the consequences of such violations are less severe than previously thought. Although these conclusions should not entirely discourage anyone from being concerned about the normality assumption, they have increased the overall popularity of the distribution-dependent statistical tests in all areas of research.
Entries in the Statistics Glossary are taken from the Electronic Manual of STATISTICA and may contain elements that refer to specific features of the STATISTICA system.
Abrupt Permanent Impact. In Time Series, a permanent abrupt impact pattern implies that the overall mean of the times series shifted after the intervention; the overall shift is denoted by (omega).
Abrupt Temporary Impact. In Time Series, the abrupt temporary impact pattern implies an initial abrupt increase or decrease due to the intervention that then slowly decays without permanently changing the mean of the series. This type of intervention can be summarized by the expressions:
Prior to intervention: Impact_{t} = 0
At time of intervention: Impact_{t} =
After intervention: Impact_{t} = *Impact_{t-1}
Note that this impact pattern is again defined by the two parameters (delta) and (omega). As long as the parameter is greater than 0 and less than 1 (the bounds of system stability), the initial abrupt impact will gradually decay. If is near 0 (zero), the decay will be very quick, and the impact will have entirely disappeared after only a few observations. If is close to 1, the decay will be slow, and the intervention will affect the series over many observations. Note that, when evaluating a fitted model, it is again important that both parameters are statistically significant; otherwise, we could reach paradoxical conclusions. For example, suppose the parameter is not statistically significant from 0 (zero) but the parameter is; this would mean that an intervention did not cause an initial abrupt change, which then showed significant decay.
Accept-Support (AS) Testing. In this type of statistical test, the statistical null hypothesis is the hypothesis that, if true, supports the experimenter's theoretical hypothesis. Consequently, in AS testing, the experimenter would prefer not to obtain "statistical significance." In AS testing, accepting the null hypothesis supports the experimenter's theoretical hypothesis. For more information, see Power Analysis.
Activation Function (in Neural Networks). A function used to transform the activation level of a unit (neuron) into an output signal. Typically, activation functions have a "squashing" effect. Together with the PSP function (which is applied first), this defines the unit type. Neural Networks supports a wide range of activation functions. Only a few of these are used by default; the others are available for customization.
Identity. The activation level is passed on directly as the output. Used in a variety of network types, including linear networks and the output layer of radial basis function networks.
Logistic. This is an S-shaped (sigmoid) curve with output in the range (0,1).
Hyperbolic. The hyperbolic tangent function (tanh): a sigmoid curve, like the logistic function, except that output lies in the range (-1,+1). Often performs better than the logistic function because of its symmetry. Ideal for customization of multilayer perceptrons, particularly the hidden layers.
Exponential. The negative exponential function. Ideal for use with radial units. The combination of radial synaptic function and negative exponential activation function produces units that model a Gaussian (bell-shaped) function centered at the weight vector. The standard deviation of the Gaussian is given by the formula below, where d is the "deviation" of the unit stored in the unit's threshold:
Softmax. Exponential function, with results normalized so that the sum of activations across the layer is 1.0. Can be used in the output layer of multilayer perceptrons for classification problems, so that the outputs can be interpreted as probabilities of class membership (Bishop, 1995; Bridle, 1990).
Unit sum. Normalizes the outputs to sum to 1.0. Used in PNNs to allow the outputs to be interpreted as probabilities.
Square root. Used to transform the squared distance activation in an SOFM network or Cluster network to the actual distance as an output.
Sine. Possibly useful if recognizing radially-distributed data; not used by default.
Ramp. A piece-wise linear version of the sigmoid function. Relatively poor training performance, but fast execution.
Step. Outputs either 1.0 or 0.0, depending on whether the Synaptic value is positive or negative. Can be used to model simple networks such as perceptrons.
The mathematical definitions of the activation functions are shown in the table below:
Activation Functions
Function |
Definition |
Range |
Identity |
x |
(-inf,+inf) |
Logistic |
(0,+1) | |
Hyperbolic |
(-1,+1) | |
-Exponential |
(0, +inf) | |
Softmax |
(0,+1) | |
Unit sum |
(0,+1) | |
Square root |
(0, +inf) | |
Sine |
sin(x) |
[0,+1] |
Ramp |
[-1,+1] | |
Step |
[0,+1] |
Additive Models. Additive models represent a generalization of Multiple Regression (which is a special case of general linear models). Specifically, in linear regression, a linear least-squares fit is computed for a set of predictor or X variables, to predict a dependent Y variable. The well known linear regression equation with m predictors, to predict a dependent variable Y, can be stated as:
Y = b_{0} + b_{1}*X_{1} + .. b_{m}*X_{m}
Where Y stands for the (predicted values of the) dependent variable, X_{1} through X_{m} represent the m values for the predictor variables, and b_{0}, and b_{1} through b_{m} are the regression coefficients estimated by multiple regression.
A generalization of the multiple regression model would be to maintain the additive nature of the model, but to replace the simple terms of the linear equation b_{i}*X_{i} with f_{i}(X_{i}) where f_{i} is nonparametric function of the predictor X_{i}. In other words, instead of a single coefficient for each variable (additive term) in the model, in additive models an unspecified (non-parametric) function is estimated for each predictor, to achieve the best prediction of the dependent variable values. For additional information, see Hastie and Tibshirani, 1990, or Schimek, 2000.
Additive Season, Damped Trend. In this Time Series model, the simple exponential smoothing forecasts are "enhanced" both by a damped trend component (independently smoothed with the single parameter ; this model is an extension of Brown's one-parameter linear model; see Gardner, 1985, p. 12-13) and an additive seasonal component (smoothed with parameter ). For example, suppose we wanted to forecast from month to month the number of households that purchase a particular consumer electronics device (e.g., VCR). Every year, the number of households that purchase a VCR will increase; however, this trend will be damped (i.e., the upward trend will slowly disappear) over time as the market becomes saturated. In addition, there will be a seasonal component, reflecting the seasonal changes in consumer demand for VCRs from month to month (demand will likely be smaller in the summer and greater during the December holidays). This seasonal component may be additive; for example, a relatively stable number of additional households may purchase VCRs during the December holiday season. To compute the smoothed values for the first season, initial values for the seasonal components are necessary. Also, to compute the smoothed value (forecast) for the first observation in the series, both estimates of S_{0} and T_{0} (initial trend) are necessary. By default, these values are computed as:
T_{0} = (1/)*(M_{k}-M_{1})/[(k-1)*p]
where
is the smoothing parameter
k is the number of complete seasonal cycles
M_{k} is the mean for the last seasonal cycle
M_{1} is the mean for the first seasonal cycle
p is the length of the seasonal cycle
and S_{0} = M_{1} - p*T_{0}/2
Additive Season, Exponential Trend. In this Time Series model, the simple exponential smoothing forecasts are "enhanced" both by an exponential trend component (independently smoothed with parameter ) and an additive seasonal component (smoothed with parameter ). For example, suppose we wanted to forecast the monthly revenue for a resort area. Every year, revenue may increase by a certain percentage or factor, resulting in an exponential trend in overall revenue. In addition, there could be an additive seasonal component, for example a particular fixed (and slowly changing) amount of added revenue during the December holidays.
To compute the smoothed values for the first season, initial values for the seasonal components are necessary. Also, to compute the smoothed value (forecast) for the first observation in the series, both estimates of S_{0} and T_{0} (initial trend) are necessary. By default, these values are computed as:
T_{0} = exp((log(M_{2}) - log(M_{1}))/p)
where
M_{2} is the mean for the second seasonal cycle
M_{1} is the mean for the first seasonal cycle
p is the length of the seasonal cycle
and S_{0} = exp(log(M_{1}) - p*log(T_{0})/2)
Additive Season, Linear Trend. In this Time Series model, the simple exponential smoothing forecasts are "enhanced" both by a linear trend component (independently smoothed with parameter ) and an additive seasonal component (smoothed with parameter ). For example, suppose we were to predict the monthly budget for snow removal in a community. There may be a trend component (as the community grows, there is a steady upward trend for the cost of snow removal from year to year). At the same time, there is obviously a seasonal component, reflecting the differential likelihood of snow during different months of the year. This seasonal component could be additive, meaning that a particular fixed additional amount of money is necessary during the winter months, or (see below) multiplicative, that is, given the respective budget figure, it may increase by a factor of, for example, 1.4 during particular winter months.
To compute the smoothed values for the first season, initial values for the seasonal components are necessary. Also, to compute the smoothed value (forecast) for the first observation in the series, both estimates of S_{0} and T_{0} (initial trend) are necessary. By default, these values are computed as:
T_{0} = (M_{k}-M_{1})/((k-1)*p
where
k is the number of complete seasonal cycles
M_{k} is the mean for the last seasonal cycle
M_{1} is the mean for the first seasonal cycle
p is the length of the seasonal cycle
and S_{0} = M_{1} - T_{0}/2
Additive Season, No Trend. This Time Series model is partially equivalent to the simple exponential smoothing model; however, in addition, each forecast is "enhanced" by an additive seasonal component that is smoothed independently (see The seasonal smoothing parameter ). This model would, for example, be adequate when computing forecasts for monthly expected amount of rain. The amount of rain will be stable from year to year, or change only slowly. At the same time, there will be seasonal changes ("rainy seasons"), which again may change slowly from year to year.
To compute the smoothed values for the first season, initial values for the seasonal components are necessary. The initial smoothed value S_{0} will by default be computed as the mean for all values included in complete seasonal cycles.
Adjusted Means. These are the means we would get after removing all differences that can be accounted for by the covariate in an analysis of variance design (see ANOVA).
The general formula (see Kerlinger & Pedhazur, 1973, p. 272) is
Y-bar_{j(adj)} = Y-bar_{j} - b(X-bar_{j} - X-bar)
where
Y-bar_{j(adj)} the adjusted mean of group j;
Y-bar_{j} the mean of group j before adjustment;
b the common regression coefficient;
X-bar_{j} the mean of the covariate for group j;
X-bar the grand mean of the covariate.
See also categorical predictor variable, covariates and General Linear Models or ANOVA/MANOVA.
Aggregation. Aggregation is a function that summarizes data from one or more sources. This could be a measure of central tendency (mean, median, mode), variation (standard deviation), range (minimum, maximum), or a total (sum). Aggregation can occur by either collapsing or expanding the data.
AID. AID (Automatic Interaction Detection) is a classification program developed by Morgan & Sonquist (1963) that led to the development of the THAID (Morgan & Messenger, 1973) and CHAID (Kass, 1980) classification tree programs. These programs perform multi-level splits when computing classification trees. For discussion of the differences of AID from other classification tree programs, see A Brief Comparison of Classification Tree Programs.
Akaike Information Criterion (AIC). When a model involving q parameters is fitted to data, the criterion is defined as -2Lq + 2q where Lq is the maximized log likelihood. Akaike suggested maximizing the criterion to choose between models with different numbers of parameters. It was originally proposed for time series models, but is also used in regression. Akaike Information Criterion (AIC) can be used in Generalized Linear/Nonlinear Models (GLZ) when comparing the subsets of effects during best subset regression. Since the evaluation of the score statistic does not require iterative computations, best subset selection based on the score statistic is computationally faster, while the selection based on the AIC statistic usually provides more accurate results.
In Structural Equation Modeling, AIC can be computed using the discrepancy function with the formula F_{k} + 2v/(N+1) where F_{k} is the discrepancy function of the model with k parameters, v is the degrees of freedom for the model and N is the sample size.
Algorithm. As opposed to heuristics (which contain general recommendations based on statistical evidence or theoretical reasoning), algorithms are completely defined, finite sets of steps, operations, or procedures that will produce a particular outcome. For example, with a few exceptions, all computer programs, mathematical formulas, and (ideally) medical and food recipes are algorithms. See also, Data Mining, Neural Networks, heuristic.
Anderson-Darling Test. The Anderson-Darling procedure is a general test to compare the fit of an observed cumulative distribution function to an expected cumulative distribution function. This test is applicable to complete data sets (without censored observations). The critical values for the Anderson-Darling statistic have been tabulated (see, for example, Dodson, 1994, Table 4.4) for sample sizes between 10 and 40; this test is not computed for n less than 10 and greater than 40.
The Anderson-Darling test is used in Weibull and Reliability/Failure Time Analysis; see also, Mann-Scheuer-Fertig Test and Hollander-Proschan Test.
Append a Network. A function to allow two neural networks (with compatible output and input layers) to be joined into a single network.
Append Cases and/or Variables. Functions that add new cases (i.e., rows of data) and/or variables (i.e., columns of data) to the end of the data set (the "bottom" or the right hand side, respectively). Cases and Variables can also be inserted in arbitrary locations of the data set.
Application Programming Interface (API). Application Programming Interface is a set of functions that conform to conventions of a particular operating system (e.g., Windows), which allows the user to programmatically access the functionality of another program. For example, the kernel of STATISTICA Neural Networks can be accessed by other program packages (e.g., Visual Basic, STATISTICA BASIC, Delphi, C, C++) in a variety of ways.
Arrow. An element in a path diagram used to indicate causal flow from one variable to another, or, in narrower interpretation, to show which of two variables in a linear equation is the independent variable and which is the dependent variable.
Assignable Causes and Actions. In the context of monitoring quality characteristics, we have to distinguish between two different types of variability: Common cause variation describes random variability that is inherent in the process and affects all individual values. Ideally, when your process is in-control, only common cause variation will be present. In a quality control chart, it will show up as a random fluctuation of the individual samples around the center line with all samples falling between the upper and lower control limit and no non-random patterns (runs) of adjacent samples. Special cause or assignable cause variation is due to specific circumstances that can be accounted for. It will usually show up in the QC chart as outlier samples (i.e., exceeding the lower or upper control limit) or as a systematic pattern (run) of adjacent samples. It will also affect the calculation of the chart specifications (center line and control limits).
With some software programs, if we investigate the out-of-control conditions and find an explanation for them, we can assign descriptive labels to those out-of-control samples and explain the causes (e.g., valve defect) and actions that have been taken (e.g. valve fixed). Having causes and actions displayed in the chart will document that the center line and the control limits of the chart are affected by special cause variation in the process.
Association Rules. Data mining for association rules is often the first and most useful method for analyzing data that describe transactions, lists of items, unique phrases (in text mining), etc. In general, association rules take the form If Body then Head, where Body and Head stand for simple codes, text values, items, consumer choices, phrases, etc., or the conjunction of codes and text values, etc. (e.g., if (Car=Porsche and Age<20 and ThrillSeeking=High) then (Risk=High and Insurance=High); here the logical conjunction before the "then" would be the Body, and the logical conjunction following the "then" would be the Head of the association rule). The a-priori algorithm (see Agrawal and Swami, 1993; Agrawal and Srikant, 1994; Han and Lakshmanan, 2001; see also Witten and Frank, 2000) is a popular and efficient algorithm for deriving such association rules from large data sets, based on some user-defined "threshold" values for rule.
Asymmetrical Distribution. If you split the distribution in half at its mean (or median), the distribution of values on the two sides of this central point would not be the same (i.e., not symmetrical) and the distribution would be considered "skewed." See also, Descriptive Statistics Overview.
Attribute (Attribute Variable). An alternative name for a nominal variable.
Augmented Product Moment Matrix. For a set of p variables, this is a (p + 1) X (p + 1) square matrix. The first p rows and columns contain the matrix of moments about zero, while the last row and column contain the sample means for the p variables. The matrix is therefore of the form:
where M is a matrix with element
and is a vector with the means of the variables (see Structural Equation Modeling).
Autoassociative Network. A neural network (usually a multilayer perceptron) designed to reproduce its inputs at its outputs, while "squeezing" the data through a lower-dimensionality middle layer. Used for compression or dimensionality reduction purposes (see Fausett, 1994; Bishop, 1995).
Automatic Network Designer. A heuristic algorithm (implemented in STATISTICA Neural Networks)that experimentally determines an appropriate network architecture to fit a specified data set.
B Coefficients. A line in a two dimensional or two-variable space is defined by the equation Y=a+b*X; in full text: the Y variable can be expressed in terms of a constant (a) and a slope (b) times the X variable. The constant is also referred to as the intercept, and the slope as the regression coefficient or B coefficient. In general then, multiple regression procedures will estimate a linear equation of the form:
Y = a + b_{1}*X_{1} + b_{2}*X_{2} + ... +b_{p}*X_{p}
Note that in this equation, the regression coefficients (or B coefficients) represent the independent contributions of each independent variable to the prediction of the dependent variable. However, their values may not be comparable between variables because they depend on the units of measurement or ranges of the respective variables. Some software products will produce both the raw regression coefficients (B coefficients) and the Beta coefficients (note that the Beta coefficients are comparable across variables). See also, Multiple Regression.
Back Propagation (in Neural Networks). Back propagation is the best known training algorithm for neural networks and still one of the most useful. Devised independently by Rumelhart et. al. (1986), Werbos (1974), and Parker (1985), it is thoroughly described in most neural network textbooks (e.g., Patterson, 1996; Fausett, 1994; Haykin, 1994). It has lower memory requirements than most algorithms, and usually reaches an acceptable error level quite quickly, although it can then be very slow to converge properly on an error minimum. It can be used on most types of networks, although it is most appropriate for training multilayer perceptrons.
Back propagation includes:
Time-dependent learning rate
Time-dependent momentum rate
Random shuffling of order of presentation.
Additive noise during training
Independent testing on selection set
A variety of stopping conditions
RMS error plotting: graph
Selectable error function
The last five bulleted items are equally available in other iterative algorithms, including conjugate gradient descent, Quasi-Newton, Levenberg-Marquardt, quick propagation, Delta-bar-Delta, and Kohonen training (apart from noise in conjugate gradients, Kohonen and Levenberg-Marquardt, and selectable error function in Levenberg-Marquardt).
Technical Details. The on-line version of back propagation calculates the local gradient of each weight with respect to each case during training. Weights are updated once per training case.
The update formula is:
h - the learning rate
d - the local error gradient
a - the momentum coefficient
oi - the output of the i'th unit
Thresholds are treated as weights with oi = -1.
The local error gradient calculation depends on whether the unit into which the weights feed is in the output layer or the hidden layers.
Local gradients in output layers are the product of the derivatives of the network's error function and the units' activation functions.
Local gradients in hidden layers are the weighted sum of the unit's outgoing weights and the local gradients of the units to which these weights connect.
Bagging (Voting, Averaging). The concept of bagging (voting for classification, averaging for regression-type problems with continuous dependent variables of interest) applies to the area of predictive data mining, to combine the predicted classifications (prediction) from multiple models, or from the same type of model for different learning data. It is also used to address the inherent instability of results when applying complex models to relatively small data sets.
Suppose your data mining task is to build a model for predictive classification, and the dataset from which to train the model (learning data set, which contains observed classifications) is relatively small. You could repeatedly sub-sample (with replacement) from the dataset, and apply, for example, a tree classifier (e.g., C&RT and CHAID) to the successive samples. In practice, very different trees will often be grown for the different samples, illustrating the instability of models often evident with small datasets. One method of deriving a single prediction (for new observations) is to use all trees found in the different samples, and to apply some simple voting: The final classification is the one most often predicted by the different trees. Note that some weighted combination of predictions (weighted vote, weighted average) is also possible, and commonly used. A sophisticated (machine learning) algorithm for generating weights for weighted prediction or voting is the Boosting procedure.
Balanced ANOVA Design. An ANOVA design is a balanced ANOVA design when all cells in the ANOVA design have equal N, when there are no missing cells in the design, and, if nesting is present, when the nesting is balanced so that equal numbers of levels of the factors that are nested appear in the levels of the factor(s) that they are nested in. Most between-groups ANOVA designs can be analyzed much more efficiently when they are balanced ANOVA designs.
Bar/Column Plots, 2D. The Bar/Column Plot represents sequences of values as bars or columns (one case is represented by one bar/column). If more than one variable is selected, each plot can be represented in a separate graph or all of them can be combined in one display as multivariate clusters of bars/columns (one cluster per case, see example below).
Bar Dev Plot. The "bar deviation" plot is similar to the Bar X plot, in that individual data points are represented by vertical bars, however, the bars connect the data points to a user-selectable baseline. If the baseline value is different than the plot's Y-axis minimum, then individual bars will extend either up or down, depending on the direction of the "deviation" of individual data points from the baseline.
Bar Left Y Plot. In this plot, one horizontal bar is drawn for each data point (i.e., each pair of XY coordinates, see example below), connecting the data point and the left Y-axis. The vertical position of the bar is determined by the data point's Y value, and its length by the respective X value.
Bar Right Y Plot. In this plot, one horizontal bar is drawn for each data point (i.e., each pair of XY coordinates), connecting the data point and the right Y-axis. The vertical position of the bar is determined by the data point's Y value, and its length by the respective X value.
Bar Top Plot. (Also known as "hanging" column plots.) In this plot, one vertical bar is drawn for each data point (i.e., each pair of XY coordinates), connecting the data point and the upper X-axis. The horizontal position of the bar is determined by the data point's X value, and its length by the respective Y value.
Bar X Plot. In this plot, one vertical bar is drawn for each data point (i.e., each pair of XY coordinates), connecting the data point and the lower X-axis.
The horizontal position of the bar is determined by the data point's X value, and its height by the respective Y value.
Bartlett Window. In Time Series, the Bartlett window is a weighted moving average transformation used to smooth the periodogram values. In the Bartlett window (Bartlett, 1950) the weights are computed as:
w_{j} = 1-(j/p) (for j = 0 to p)
w_{-j} = w_{j} (for j 0)
where p = (m-1)/2
This weight function will assign the greatest weight to the observation being smoothed in the center of the window, and increasingly smaller weights to values that are further away from the center. See also, Basic Notations and Principles.
Basis Functions. Basis functions of predictor variables (X) play an important role in the estimation of Multivariate Adaptive Regression Splines (MARSplines). Specifically, MARSplines uses two-sided truncated functions of the form (as shown below) as basis functions for linear or non-linear expansion which approximates the relationships between the response and predictor variables.
Shown above is a simple example of two basis functions (t-x)+ and (x-t)+. Parameter t is the knot of the basis functions (defining the "pieces" of the piecewise linear regression); these knots (parameters ) are also determined from the data.
Batch Algorithms in STATISTICA Neural Networks. Algorithms that calculate the average gradient over an epoch, rather than adjusting on a case-by-case basis during training. Quick propagation, Delta-Bar-Delta, conjugate gradient descent and Levenberg-Marquardt are all batch algorithms.
Bayesian Information Criterion (BIC). When a model involving q parameters is fitted to data with n observations, the Bayesian Information criterion is defined as - 2 L_{q} + q * ln(n), where L_{q} is the maximized log-likelihood. This goodness of fit statistic adjusts for the number of parameters estimated as well as for the given amount of data. This is closely related to AIC. In STATISTICA, the BIC can be used in the Generalized Linear/Nonlinear Models (GLZ) module to evaluate the fit of a model.
Bayesian Networks. Networks based on Bayes' theorem, on the inference of probability distributions from data sets. See also, probabilistic and generalized regression neural networks.
Bayesian Statistics (Analysis). Bayesian analysis is an approach to statistical analysis that is based on the Bayes's law, which states that the posterior probability of a parameter p is proportional to the prior probability of parameter p multiplied by the likelihood of p derived from the data collected. This increasingly popular methodology represents an alternative to the traditional (or frequentist probability) approach: whereas the latter attempts to establish confidence intervals around parameters, and/or falsify a-priori null-hypotheses, the Bayesian approach attempts to keep track of how a-priori expectations about some phenomenon of interest can be refined, and how observed data can be integrated with such a-priori beliefs, to arrive at updated posterior expectations about the phenomenon.
A good metaphor (and actual application) for the Bayesian approach is that of a physician who applies consecutive examinations to a patient so as to refine the certainty of a particular diagnosis: The results of each individual examination or test should be combined with the a-priori knowledge about the patient, and expectation that the respective diagnosis is correct. The goal is to arrive at a final diagnosis which the physician believes to be correct with a known degree of certainty.
Bernoulli Distribution. The Bernoulli distribution best describes all situations where a "trial" is made resulting in either "success" or "failure," such as when tossing a coin, or when modeling the success or failure of a surgical procedure. The Bernoulli distribution is defined as:
f(x) = p^{x} * (1-p)^{1-x}
for x Î {0,1}
where p is the probability that a particular event (e.g., success) will occur.
For a complete listing of all distribution functions, see Distributions and Their Functions.
Best Network Retention. A facility (implemented in STATISTICA Neural Networks) to automatically store the best neural network discovered during training, for later restoration at the end of a set of experiments. See also, Neural Networks.
Best Subset Regression. A model-building technique which finds subsets of predictor variables that best predict responses on a dependent variable by linear (or nonlinear) regression.
For an overview of best subset regression see General Regression Models; for nonlinear stepwise and best subset regression, see Generalized Linear Models.
Beta Coefficients. The Beta coefficients are the regression coefficients you would have obtained had you first standardized all of your variables to a mean of 0 and a standard deviation of 1. Thus, the advantage of Beta coefficients (as compared to B coefficients, which are not standardized) is that the magnitude of these Beta coefficients allows you to compare the relative contribution of each independent variable in the prediction of the dependent variable. See also, Multiple Regression.
Beta Distribution. The beta distribution (the term first used by Gini, 1911) is defined as:
f(x) = (+)/(()()) * x^{-1} * (1-x)^{-1}
0 x 1
> 0, > 0
where
(gamma) is the Gamma function
, are the shape parameters
The animation above shows the beta distribution as the two shape parameters change.
Big Data. Typically, the discussion of big data in the context of predictive modeling and data mining pertains to data repositories (and the analyses based on such repositories) that are larger than a few terabytes (1 terabyte = 1,000 gigabytes; 1 gigabyte = 1,000 megabytes). Some data repositories may grow to thousands of terabytes, i.e., to the petabyte range (1,000 terabytes = 1 petabyte). Beyond petabytes, data storage can be measured in exabytes; for example, the manufacturing sector worldwide in 2010 is estimated to have stored a total of 2 exabytes of new information (Manyika et al., 2011).
Bimodal Distribution. A distribution that has two modes (thus two "peaks").
Bimodality of the distribution in a sample is often a strong indication that the distribution of the variable in population is not normal. Bimodality of the distribution may provide important information about the nature of the investigated variable (i.e., the measured quality). For example, if the variable represents a reported preference or attitude, then bimodality may indicate a polarization of opinions. Often however, the bimodality may indicate that the sample is not homogenous and the observations come in fact from two or more "overlapping" distributions. Sometimes, bimodality of the distribution may indicate problems with the measurement instrument (e.g., "gage calibration problems" in natural sciences, or "response biases" in social sciences). See also unimodal distribution, multimodal distribution.
Binomial Distribution. The binomial distribution (the term first used by Yule, 1911) is defined as:
f(x) = [n!/(x!*(n-x)!)] * p^{x} * q^{n-x}
for x = 0, 1, 2, ..., n
where
p is the probability of success at each trial
q is equal to 1-p
n is the number of independent trials
Bivariate Normal Distribution. Two variables follow the bivariate normal distribution if for each value of one variable, the corresponding values of another variable are normally distributed. The bivariate normal probability distribution function for a pair of continuous random variables (X and Y) is given by:
f(x,y) = {1/[2_{1}_{2} * (1-)^{1/2}]} * exp[-1/2(1-^{2})] * {[(x-_{1})/_{1}]^{2} - |
2[(x-_{1})/_{1}] * [(y-_{2})/_{2}] + [(y-_{2})/_{2}]^{2}} |
- < x < , - < y < , - < _{1} < , - < _{2} < , _{1} > 0, _{2} > 0, and -1 < < 1 |
where
_{1}, _{2} are the respective means of the random variables X and Y
_{1}, _{2} are the respective standard deviations of the random variables X and Y
is the correlation coefficient of X and Y
e is the base of the natural logarithm, sometimes called Euler's e (2.71...)
is the constant Pi (3.14...)
See also, Normal Distribution, Elementary Concepts (Normal Distribution)
Blocking (in Experimental Design). In some experiments, observations are organized in natural "chunks" or blocks. You want to make sure that these blocks do not bias your estimates of main effects or interactions. For example, consider an experiment to improve the quality of special ceramics, produced in a kiln. The size of the kiln is limited so that you cannot produce all runs (observations) of your experiment at once. In that case you need to break up the experiment into blocks. However, you do not want to run positive factor settings (for all factors in your experiment) in one block, and all negative settings in the other. Otherwise, any incidental differences between blocks would systematically affect all estimates of the main effects and interactions of the factors of interest. Rather, you want to distribute the runs over the blocks so that any differences between blocks (i.e., the blocking factor) do not bias your results for the factor effects of interest. This is accomplished by treating the blocking factor as another factor in the design. Blocked designs often also have the advantage of being statistically more powerful, because they allow you to estimate and control the variability in the production process that is due to differences between blocks.
For a detailed discussion of various blocked designs, and for examples of how to analyze such designs, see Experimental Design and General Linear Models.
Bonferroni Adjustment. When performing multiple statistical significance tests on the same data, the Bonferroni adjustment can be applied to make it more "difficult" for any one test to be statistically significant. For example, when reviewing multiple correlation coefficients form a correlation matrix, accepting and interpreting the correlations that are statistically significant at the conventional .05 level may be inappropriate, given that multiple tests are performed. Specifically, the alpha error probability of erroneously accepting the observed correlation coefficient as not-equal-to-zero when in fact (in the population) it is equal to zero may be much larger than .05 in this case.
The Bonferroni adjustment usually is accomplished by dividing the alpha level (usually set to .05, .01, etc.) by the number of tests being performing. For instance, suppose you performed multiple tests of individual correlations from the same correlation matrix. The Bonferroni adjusted level of significance for any one correlation would be:
.05 / 5 = .01
Any test that results in a p-value of less than .01 would be considered statistically significant; correlations with a probability value greater than .01 (including those with p-values between .01 and .05) would be considered non-significant.
Bonferroni Test. This post hoc test can be used to determine the significant differences between group means in an analysis of variance setting. The Bonferroni test is very conservative when a large number of group means are being compared (for a detailed discussion of different post hoc tests, see Winer, Michels, & Brown (1991). For more details, see General Linear Models. See also, Post Hoc Comparisons. For a discussion of statistical significance, see Elementary Concepts.
Boosting. The concept of boosting applies to the area of predictive data mining, to generate multiple models or classifiers (for prediction or classification), and to derive weights to combine the predictions from those models into a single prediction or predicted classification (see also Bagging).
A simple algorithm for boosting works like this: Start by applying some method (e.g., a tree classifier such as C&RT or CHAID) to the learning data, where each observation is assigned an equal weight. Compute the predicted classifications, and apply weights to the observations in the learning sample that are inversely proportional to the accuracy of the classification. In other words, assign greater weight to those observations that were difficult to classify (where the misclassification rate was high), and lower weights to those that were easy to classify (where the misclassification rate was low). In the context of C&RT for example, different misclassification costs (for the different classes) can be applied, inversely proportional to the accuracy of prediction in each class. Then apply the classifier again to the weighted data (or with different misclassification costs), and continue with the next iteration (application of the analysis method for classification to the re-weighted data).
Boosting will generate a sequence of classifiers, where each consecutive classifier in the sequence is an "expert" in classifying observations that were not well classified by those preceding it. During deployment (for prediction or classification of new cases), the predictions from the different classifiers can then be combined (e.g., via voting, or some weighted voting procedure) to derive a single best prediction or classification.
Note that boosting can also be applied to learning methods that do not explicitly support weights or misclassification costs. In that case, random sub-sampling can be applied to the learning data in the successive steps of the iterative boosting procedure, where the probability for selection of an observation into the subsample is inversely proportional to the accuracy of the prediction for that observation in the previous iteration (in the sequence of iterations of the boosting procedure).
Boundary Case. A boundary case occurs when a parameter iterates to the "boundary" of the permissible "parameter space" (see Structural Equation Modeling). For example, a variance can only take on values from 0 to infinity. If, during iteration, the program attempts to move an estimate of a variance below zero, the program will constrain it to be on the boundary value of 0.
For some problems (for example a Heywood Case in factor analysis), it may be possible to reduce the discrepancy function by estimating a variance to be a negative number. In that case, the program does "the best it can" within the permissible parameter space, but does not actually obtain the "global minimum" of the discrepancy function.
Box Plot/Medians (Block Stats Graphs). This type of Block Stats Graph will produce a box plot of medians (and min/max values and 25th and 75th percentiles) for the columns or rows of the block. Each box will represent data from one column or row.
Box Plot/Means (Block Stats Graphs). This type of Block Stats Graph will produce a box plot of means (and standard errors and standard deviations) for the columns or rows of the block. Each box will represent data from one column or row.
Box Plots, 2D. In Box Plots (this term was first used by Tukey, 1970), ranges or distribution characteristics of values of a selected variable (or variables) are plotted separately for groups of cases defined by values of a categorical (grouping) variable. The central tendency (e.g., median or mean), and range or variation statistics (e.g., quartiles, standard errors, or standard deviations) are computed for each group of cases and the selected values are presented in the selected box plot style. Outlier data points can also be plotted.
Box Plots, 2D - Box Whiskers. This type of box plot will place a box around the midpoint (i.e., mean or median) which represents a selected range (i.e., standard error, standard deviation, min-max, or constant) and whiskers outside of the box which also represent a selected range (see the example graph, below).
Box Plots, 2D - Boxes. This type of box plot will place a box around the midpoint (i.e., mean or median) which represents the selected range (i.e., standard error, standard deviation, min-max, or constant).
Box Plots, 2D - Whiskers. In this style of box plot, the range (i.e., standard error, standard deviation, min-max, or constant) is represented by "whiskers" (i.e., as a line with a serif on both ends, see graph below).
Box Plots, 3D. In Box Plots (this term was first used by Tukey, 1970), ranges or distribution characteristics of values of selected variables are plotted separately for groups of cases defined by values of a categorical (grouping) variable. The central tendency (e.g., median or mean), and range or variation statistics (e.g., quartiles, standard errors, or standard deviations) are computed for each group of cases and the selected values are presented in the selected box plot style. Outlier data points can also be plotted.
Box Plots 3D - Border-Style Ranges. In this style of 3D Sequential Box Plot, the ranges of values of selected variables are plotted separately for groups of cases defined by values of a categorical (grouping) variable. The central tendency (e.g., median or mean), and range or variation statistics (e.g., quartiles, standard errors, or standard deviations) are computed for each variable and for each group of cases and the selected values are presented as points with "whiskers," and the ranges marked by the "whiskers" are connected with lines (i.e., range borders) separately for each variable.
3D Range plots (see example graph below) differ from 3D Box plots in that for Range plots, the ranges are the values of the selected variables (e.g., one variable contains the minimum range values and another variable contains the maximum range values) while for Box plots, the ranges are calculated from variable values (e.g., standard deviations, standard errors, or min-max value).
Box Plots 3D - Double Ribbon Ranges. In this style of 3D Sequential Box Plot, the ranges of values of selected variables are plotted separately for groups of cases defined by values of a categorical (grouping) variable. The range or variation statistics (e.g., quartiles, standard errors, or standard deviations) are computed for each variable and for each group of cases and the selected values are presented as double ribbons.
3D Range plots (see example graph below) differ from 3D Box plots in that for Range plots, the ranges are the values of the selected variables (e.g., one variable contains the minimum range values and another variable contains the maximum range values) while for Box plots the ranges are calculated from variable values (e.g., standard deviations, standard errors, or min-max value).
Box Plots 3D - Flying Blocks. In this style of 3D Sequential Box Plot, the ranges of values of selected variables are plotted separately for groups of cases defined by values of a categorical (grouping) variable. The central tendency (e.g., median or mean), and range or variation statistics (e.g., quartiles, standard errors, or standard deviations) are computed for each variable and for each group of cases and the selected values are presented as "flying" blocks.
3D Range plots (see example graph below) differ from 3D Box plots in that for Range plots, the ranges are the values of the selected variables (e.g., one variable contains the minimum range values and another variable contains the maximum range values) while for Box plots the ranges are calculated from variable values (e.g., standard deviations, standard errors, or min-max value).
Box Plots 3D - Flying Boxes. In this style of 3D Sequential Box Plot, the ranges of values of selected variables are plotted separately for groups of cases defined by values of a categorical (grouping) variable. The central tendency (e.g., median or mean), and range or variation statistics (e.g., quartiles, standard errors, or standard deviations) are computed for each variable and for each group of cases and the selected values are presented "flying" boxes.
3D Range plots (see example graph below) differ from 3D Box plots in that for Range plots, the ranges are the values of the selected variables (e.g., one variable contains the minimum range values and another variable contains the maximum range values) while for Box plots the ranges are calculated from variable values (e.g., standard deviations, standard errors, or min-max value).
Box Plots 3D - Points. In this style of 3D Sequential Box Plot, the ranges of values of selected variables are plotted separately for groups of cases defined by values of a categorical (grouping) variable. The central tendency (e.g., median or mean), and range or variation statistics (e.g., quartiles, standard errors, or standard deviations) are computed for each variable and for each group of cases and the selected values are presented as point markers connected by a line.
3D Range plots (see example graph below) differ from 3D Box plots in that for Range plots, the ranges are the values of the selected variables (e.g., one variable contains the minimum range values and another variable contains the maximum range values) while for Box plots, the ranges are calculated from variable values (e.g., standard deviations, standard errors, or min-max value).
Box-Ljung Q Statistic. In Time Series analysis, you can shift a series by a given lag k. For that given lag, the Box-Ljung Q statistic is defined by:
Q_{k} = n*(n+2)*Sum(r_{i}^{2}/(n-1))
for i = 1 to k
When the number of observations is large, then the Q statistic has a Chi- square distribution with k-p-q degrees of freedom, where p and q are the number of autoregressive and moving average parameters, respectively.
Breakdowns. Breakdowns are procedures which allow us to calculate descriptive statistics and correlations for dependent variables in each of a number of groups defined by one or more grouping (independent) variables. It is used as either a hypothesis testing or exploratory method.
For more information, see the Breakdowns section of Basic Statistics.
Brushing. Perhaps the most common and historically first widely used technique explicitly identified as graphical exploratory data analysis is brushing, an interactive method allowing us to select on-screen specific data points or subsets of data and identify their (e.g., common) characteristics, or to examine their effects on relations between relevant variables (e.g., in scatterplot matrices) or to identify (e.g., label) outliers. For more information on brushing, see Special Topics in Graphical Analytic Techniques: Brushing.
Burt Table. Multiple correspondence analysis expects as input (i.e., the program will compute prior to the analysis) a so-called Burt table. The Burt table is the result of the inner product of a design or indicator matrix. If you denote the data (design or indicator matrix) as matrix X, then matrix product X'X is a Burt table); shown below is an example of a Burt table that one might obtain in this manner.
SURVIVAL | AGE | LOCATION | ||||||
---|---|---|---|---|---|---|---|---|
NO | YES | <50 | 50-69 | 69+ | TOKYO | BOSTON | GLAMORGN | |
SURVIVAL:NO SURVIVAL:YES AGE:UNDER_50 AGE:A_50TO69 AGE:OVER_69 LOCATION:TOKYO LOCATION:BOSTON LOCATION:GLAMORGN |
210 0 68 93 49 60 82 68 |
0 554 212 258 84 230 171 153 |
68 212 280 0 0 151 58 71 |
93 258 0 351 0 120 122 109 |
49 84 0 0 133 19 73 41 |
60 230 151 120 19 290 0 0 |
82 171 58 122 73 0 253 0 |
68 153 71 109 41 0 0 221 |
Overall, the data matrix is symmetrical. In the case of 3 categorical variables (as shown above), the data matrix consists 3 x 3 = 9 partitions, created by each variable being tabulated against itself, and against the categories of all other variables. Note that the sum of the diagonal elements in each diagonal partition (i.e., where the respective variables are tabulated against themselves) is constant (equal to 764 in this case). The off-diagonal elements in each partition in this example are all 0. If the cases in the design or indicator matrix are assigned to categories via fuzzy coding, then the off- diagonal elements of the diagonal partitions are not necessarily equal to 0.
© 2017 Quest Software Inc. ALL RIGHTS RESERVED. Feedback Terms of Use Privacy