is approximated by the chi‐square distribution. Note that since the log likelihood for the saturated model is common to both deviance values, ΔD is calculated without actually estimating the saturated model. This fact becomes very important during subset selection. The formula for ΔD that is used for testing the significance of the regression coefficient(s) associated with the independent variable X1 is ΔDX1 = Dwithout X1 −Dwith X1 = −2 [Lwithout X1 − LSaturated] + 2[Lwith X1 − LSaturated] = −2[LwithoutX1 − LwithX1].
Note that this formula looks identical to the likelihood ratio statistic. Because of the similarity between the change in deviance test and the likelihood ratio test, their names are often used interchangeably.
The formula for the Wald statistic is
2.2.2 Decision Tree Classifiers
As indicated in Section 2.1, decision trees are considered to be one of the most popular approaches for representing classifiers. Here, we present a number of methods for constructing decision tree classifiers in a top‐down manner. This section suggests a unified algorithmic framework for presenting these algorithms and describes the various splitting criteria and pruning methodologies. The material is discussed along the lines presented in [8–11].
Supervised methods are methods that attempt to discover relationship between the input attributes and the target attribute. The relationship discovered is represented in a structure referred to as a model. Usually, models can be used for predicting the value of the target attribute knowing the values of the input attributes. It is useful to distinguish between two main supervised models: classification models (classifiers) and regression models. Regression models map the input space into a real‐valued domain, whereas classifiers map the input space into predefined classes. For instance, classifiers can be used to classify students into two groups: those passing exams on time and those passing exams with a delay. Many approaches are used to represent classifiers. The decision tree is probably the most widely used approach for this purpose.
Terminology: In a typical supervised learning, a training set of labeled examples is given, and the goal is to form a description that can be used to predict previously unseen examples. Most frequently, the training sets are described as a bag instance of a certain bag schema. The bag schema provides the description of the attributes and their domains. Formally, bag schema is denoted as R(A ∪ y), where A denotes the set of n attributes A = {a1, … ai, … an}, and y represents the class variable or the target attribute.
Attributes can be nominal or numeric. When the attribute ai is nominal, it is useful to denote by
Usually, it is assumed that the training set tuples are generated randomly and independently according to some fixed and unknown joint probability distribution D over U. Note that this is a generalization of the deterministic case when a supervisor classifies a tuple using a function y = f(x). Here‚ we use the common notation of bag algebra to present the projection (π) and selection (σ) of tuples.
The ML community, which is target audience of this book, has introduced the problem of concept learning. To learn a concept is to infer its general definition from a set of examples. This definition may be either explicitly formulated or left implicit, but either way it assigns each possible example to the concept or not. Thus, a concept can be formally regarded as a function from the set of all possible examples to the Boolean set {true, false}.
On the other hand, the data mining community prefers to deal with a straightforward extension of the concept learning, known as the classification problem. In this case, we search for a function that maps the set of all possible examples into a predefined set of class labels and is not limited to the Boolean set.
An inducer is an entity that obtains a training set and forms a classifier that represents the generalized relationship between the input attributes and the target attribute.
The notation I represents an inducer and I(S) represents a classifier that was induced by performing I on a training set S. Most frequently, the goal of the classifier’s inducers is formally defined as follows: Given a training set S with input attributes set A = {a1, a2, … an} and a target attribute y from a unknown fixed distribution D over the labeled instance space, the goal is to induce an optimal classifier with minimum generalization error.
Generalization error is defined as the misclassification rate over the distribution D. In case of the nominal attributes, it can be expressed as
(2.11)
where L(y, I(S)(x) is the loss function defined as L(y, I(S)(x)) = 0, if y = I(S)(x) and 1, if y ≠ I(S)(x). In the case of numeric attributes, the sum operator is replaced with the appropriate integral operator.