It is often said that the performance of a classifier is data dependent [23]. Notwithstanding, work that proposes new classifiers usually neglects data-dependency when analyzing their performance. The strategy usually employed is to provide a few cases in which the proposed classifier outperforms baseline classifiers according to some performance measure. Similarly, theoretical studies that analyze the behavior of classifiers also tend to neglect data-dependency. They end up evaluating the performance of a classifier in a wide range of problems, resulting in weak performance bounds.
Recent efforts have tried to link data characteristics to the performance of different classifiers in order to build recommendation systems [23]. Meta-learning is an attempt to understand data a priori of executing a learning algorithm. Data that describe the characteristics of datasets and learning algorithms are called meta-data. A learning algorithm is employed to interpret these meta-data and suggest a particular learner (or ranking a few learners) in order to better solve the problem at hand.
Meta-learners for algorithm selection usually rely on data measures limited to statistical or information-theoretic descriptions. Whereas these descriptions can be sufficient for recommending algorithms, they do not explain the geometrical characteristics of the class distributions, i.e., the manner in which classes are separated or interleaved, a critical factor for determining classification accuracy. Hence, geometrical measures are proposed in [20, 21] for characterizing the geometrical complexity of classification problems. The study of these measures is a first effort to better understand classifiers’ data-dependency. Moreover, by establishing the difficulty of a classification problem quantitatively, several studies in classification can be carried out, such as algorithm recommendation, guided data pre-processing and design of problem-aware classification algorithms.
Next, we present a summary of the meta-data and geometrical measures we use to assess classification difficulty. For the latter, the reader should notice that they actually measure the apparent geometrical complexity of datasets, since the amount of training data is limited and the true probability distribution of each class is unknown.
Meta-data
The first attempt to characterize datasets for evaluating the performance of learning algorithms was made by Rendell et al. [24]. Their approach intended to predict the execution time of classification algorithms through very simple meta-attributes such as number of attributes and number of examples.
A significant improvement of such an approach was project STATLOG [25] which investigated the performance of several learning algorithms over more than twenty datasets. Approaches that followed deepened the analysis of the same set of meta-attributes for data characterization [26, 27]. This set of meta-attributes was divided in three categories: (i) simple; (ii) statistical; and (iii) information theory-based. An improved set of meta-attributes is further discussed in [28], and we make use of the following measures presented in there:
-
1.
Number of examples (N)
-
2.
Number of attributes (n)
-
3.
Number of continuous attributes (con)
-
4.
Number of nominal attributes (nom)
-
5.
Number of binary attributes (bin)
-
6.
Number of classes (cl)
-
7.
Percentage of missing values (%mv)
-
8.
Class entropy (H(Y))
-
9.
Mean attribute entropy (MAE)
-
10.
Mean attribute Gini (MAG)
-
11.
Mean mutual information of class and attributes (MMI)
-
12.
Uncertainty coefficient (UC)
We have chosen these measures because they are widely used, presenting interesting results in the meta-learning literature. At the same time, they are computationally efficient and simple to be implemented.
Measures 1–7 can be extracted in a straightforward way from data. Measures 8–12 are information-theory based. Class entropy, H(Y) is calculated as follows:
$$ H(Y) = -\sum_{j}^{cl}{p(Y=y_{j})\log_{2}{p(Y=y_{j})}} $$
(4)
where p(Y=y
j
) is the probability the class attribute Y will take value y
j
. Entropy is a measure of randomness or dispersion of a given discrete attribute. Thus, the class entropy indicates the dispersion of the class attribute. The more uniform the distribution of the class attribute, the higher the value of entropy; the less uniform, the lower the value.
Mean attribute entropy is the average entropy of all discrete (nominal) attributes. It is given by
$$ \mathit{MAE} = \frac{\sum_{i=1}^{\mathit{nom}}{H(X_{i})}}{\mathit{nom}} $$
(5)
Similarly, the mean attribute Gini is the average Gini index of all nominal attributes. The Gini index is given by
The mean mutual information of class and attributes measures the average of the information each attribute X conveys about class attribute Y. The mutual information MI(Y,X) (also known as information gain in the machine learning community) describes the reduction in the uncertainty of Y due to the knowledge of X, and it is defined as
for attribute X with i categories and class attribute Y with j categories. Note that H(Y|X=x
i
) is the entropy of class attribute Y considering only those examples in which X=x
i
. Since MI(Y,X) is defined, the mean mutual information of class and attributes is given by
$$ \mathit{MMI} = \frac{\sum_{i=1}^{\mathit{nom}}{\mathit{MI}(Y,X_{i})}}{\mathit{nom}} $$
(10)
The last measure to be defined is the uncertainty coefficient, which is the MI normalized by the entropy of the class attribute, MI(Y,X)/H(Y), which is analogous to the well-known gain ratio measure, though the gain ratio normalizes the MI by the entropy of the predictive attribute and not the class attribute.
In addition to the measures presented in [28], we have also make use of the ratio of the number of examples of the less-frequent class to the most-frequent class. For binary classification problems, such a measure indicates the balancing level of the dataset. Higher values (closer to one) indicate a balanced dataset whereas lower values indicate an imbalanced-class problem.
Next we present measures that seek to explain how the data is structured geometrically in order to assess the difficulty of a classification problem.
Geometrical complexity measures
In [20] a set of measures is presented to characterize datasets with regard to their geometrical structure. These measures can highlight the manner in which classes are separated or interleaved, which is a critical factor for classification accuracy. Indeed, the geometry of classes is crucial for determining the difficulty of classifying a dataset [21].
These measures are divided in three categories: (i) measures of overlaps in the attribute space; (ii) measures of class separability; and (iii) measures of geometry, topology, and density of manifolds.
Measures of overlaps in the attribute space
The following measures estimate different complexities related to the discriminative power of the attributes.
The maximum Fisher discriminant ratio (F1)
This measure computes the 2-class Fisher criterion, given by
where \(\mathbf {d} = \bar{\varSigma}^{-1}\varDelta \) is the directional vector on which data are projected, Δ=μ1−μ2, \(\bar{\varSigma}^{-1}\) is the pseudo-inverse of \(\bar{\varSigma}\), μ
i
is the mean vector for class i, \(\bar{\varSigma} = a\varSigma_{1} + (1-a)\varSigma_{2}\), 0≤a≤1, Σ
i
is the scatter matrix of instances for class c
i
.
A high value of the Fisher discriminant ratio indicates that there exists a vector that can separate examples belonging to different classes after these instances are projected on it.
The overlap of the per-class bounding boxes (F2)
This measure computes the overlap of the tails of distributions defined by the instances of each class. For each attribute, it computes the ratio of the width of the overlap interval (i.e., the interval that has instances of both classes) to the width of the entire interval. Then, the measure returns the product of the ratios calculated for each attribute:
$$ F2 = \prod_{i=1}^{n}\frac{\mathit{MINMAX}_{i} - \mathit{MAXMIN}_{i}}{\mathit{MAXMAX}_{i} - \mathit{MINMIN}_{i}} $$
(12)
where MINMAX
i
=min(max(f
i
,c1),max(f
i
,c2)), MAXMIN
i
=max(min(f
i
,c1),min(f
i
,c2)), MAXMAX
i
=max(max(f
i
,c1),max(f
i
,c2)), MINMIN
i
=min(min(f
i
,c1),min(f
i
,c2)), and n is the total number of attributes, f
i
is the ith attribute, c1 and c2 refer to the two classes, and max(f
i
,c
i
) and min(f
i
,c
i
) are, respectively, the maximum and minimum values of the attribute f
i
for class c
i
. Nominal values are mapped to integer values to compute this measure. A low value of this metric means that the attributes can discriminate the instances of different classes.
The maximum (individual) attribute efficiency (F3)
This measure computes the discriminative power of individual attributes and returns the value of the attribute that can discriminate the largest number of training instances. For this purpose, the following heuristic is employed. For each attribute, we consider the overlapping region (i.e., the region where there are instances of both classes) and return the ratio of the number of instances that are not in this overlapping region to the total number of instances. Then, the maximum discriminative ratio is taken as measure F3. Note that a problem is easy if there exists one attribute for which the ranges of the values spanned by each class do not overlap (in this case, this would be a linearly separable problem).
The collective attribute efficiency (F4)
This measure follows the same idea presented by F3, but now it considers the discriminative power of all the attributes (therefore, the collective attribute efficiency). To compute the collective discriminative power, we apply the following procedure. First, we select the most discriminative attribute, that is, the attribute that can separate a major number of instances of one class. Then, all the instances that can be discriminated are removed from the dataset, and the following most discriminative attribute (regarding the remaining examples) is selected. This procedure is repeated until all the examples are discriminated or all the attributes in the attribute space are analyzed. Finally, the measure returns the proportion of instances that have been discriminated. Thus, it gives us an idea of the fraction of instances whose class could be correctly predicted by building separating hyperplanes that are parallel to one of the axis in the attribute space. Note that the measure described herein slightly differs from the maximum attribute efficiency. F3 only considers the number of examples discriminated by the most discriminative attribute, instead of all the attributes. Thence, F4 provides more information by taking into account all the attributes since we want to highlight the collective discriminative power of all the attributes.
Measures of class separability
In this section, we describe five measures that examine the shape of the class boundary to estimate the complexity of separating instances of different classes.
The fraction of points on the class boundary (N1)
This measure provides an estimate of the length of the class boundary. For this purpose, it builds a minimum spanning tree over the entire dataset and returns the ratio of the number nodes of the spanning tree that are connected and belong to different classes to the total number of examples in the dataset. If a node n
i
is connected with nodes of different classes, n
i
is counted only one time. High values of this measure indicate that the majority of the points lay closely to the class boundary, and therefore, that it may be more difficult for the learner to define this class boundary accurately.
The ratio of average intra/inter-class nearest neighbor distance (N2)
This measure compares the within-class spread with the distances to the nearest neighbors of other classes. That is, for each input instance x
i
, we calculate the distance to its nearest neighbor within the class (intraDist(x
i
)) and the distance to its nearest neighbor of any other class (interDist(x
i
)). Then, the result is the ratio of the sum of the intra-class distances to the sum of the inter-class distances for each input example:
$$ N2 = \frac{\sum_{i=1}^{N}\mathit{intraDist}(x_{i})}{\sum_{i=1}^{N}\mathit{interDist}(x_{i})} $$
(13)
where N is the total number of instances in the dataset.
Low values of this measure suggest that the examples of the same class lay closely in the attribute space. High values indicate that the examples of the same class are disperse.
The leave-one-out error rate of the one-nearest neighbor classifier (N3)
The measure denotes how close the examples of different classes are. It returns the leave-one-out error rate of the one-nearest neighbor (the kNN classifier with k=1) learner. Low values of this metric indicate that there is a large gap in the class boundary.
The minimized sum of the error distance of a linear classifier (L1)
This measure evaluates to what extent the training data is linearly separable. For this purpose, it returns the sum of the difference between the prediction of a linear classifier and the actual class value. We use a support vector machine (SVM) [29] with a linear kernel, which is trained with the sequential minimal optimization (SMO) algorithm to build the linear classifier. The SMO algorithm provides an efficient training method, and the result is a linear classifier that separates the instances of two classes by means of a hyperplane. A zero value of this metric indicates that the problem is linearly separable.
The training error of a linear classifier (L2)
This measure provides information about to what extent the training data is linearly separable. It builds the linear classifier as explained above and returns its training error.
Measures of geometry, topology, and density of manifolds
The following four metrics indirectly characterize the class separability by assuming that a class is made up of single and multiple manifolds that form the support of the distribution of the class.
The nonlinearity of a linear classifier (L3)
This metric implements a measure of nonlinearity proposed in [30]. Given the training dataset, the method creates a test set by linear interpolation with random coefficients between pairs of randomly selected instances of the same class. Then, the measure returns the test error rate of the linear classifier (the support vector machine with linear kernel) trained with the original training set. The metric is sensitive to the smoothness of the classifier boundary and the overlap on the convex hull of the classes. This metric is implemented only for 2-class datasets.
The nonlinearity of the one-nearest neighbor classifier (N4)
This measure creates a test set as proposed by L3 and returns the test error of the 1NN classifier.
The fraction of maximum covering spheres (T1)
This measure was originated in the work of Lebourgeois and Emptoz [31], which described the shapes of class manifolds with the notion of adherence subset. In summon, an adherence subset is a sphere centered on an example of the dataset which is grown as much as possible before touching any example of another class. Therefore, an adherence subset contains a set of examples of the same class and cannot grow more without including examples of other classes. The metric considers only the biggest adherence subsets or spheres, removing all those that are included in others. Then, the metric returns the number of spheres normalized by the total number of points.
The average number of points per dimension (T2)
This measure returns the ratio of the number of examples in the dataset to the number of attributes. It is a rough indicator of sparseness of the dataset.
Results of the data-dependency analysis
We have calculated the 13 complexity measures and the 13 meta-attributes for the 129 datasets, and we have built a training set in which each example corresponds to a dataset and each attribute is one of the 26 measures. In addition, we have included which method was better for each dataset regarding test accuracy (ClusEM, Clusk or J48) as our class attribute. Thus, we have a training set of 129 examples and 27 attributes.
Our intention with this training set is to perform a descriptive analysis in order to understand which aspects of the data have a higher influence in determining the performance of the algorithms. In particular, we search for evidence that may help the user to choose a priori the most suitable method for classification. First, we have built a decision tree for each pairwise comparison (Clusk×J48 and ClusEM×J48) over the previously described training set. The idea is that the rules extracted from this decision tree may offer an insight on the reasons one algorithm outperforms the other. In other words, we are using a decision tree as a descriptive tool instead of using it as a predictive tool. This is not unusual, since the classification model provided by the decision tree can serve as an explanatory tool to distinguish between objects of different classes [32]. Figures 1 and 2 show these descriptive decision trees, which explain the behavior of roughly 90 % of the data.
First, we start by analyzing the scenarios in which one may choose to employ either ClusEM or J48. By analyzing decision tree in Fig. 1, we can notice that J48 is recommended for problems in which the N4 measure is above a given threshold. N4 provides an idea of data linearity, and the higher its value, the more complex is the decision boundary between classes. The decision tree indicates that above a given threshold of N4 (\(\mathtt{N4} > \mathtt{0.182}\)), clustering provides no advantage for classifying objects. For the remaining cases, the decision tree shows that those problems deemed to be simpler by the attribute overlapping measures (F1 and F4) are better handled by J48. Specifically regarding F4, notice that the descriptive decision tree recommends employing J48 for problems in which 100 % of the instances can be correctly predicted by building separating hyperplanes that are parallel to one of the attribute axis (\(\mathtt{F4} >\mathtt{0.998}\)). Thus, for simpler problems in which axis-parallel hyperplanes can separate the classes, J48 is an effective option. This conclusion is intuitive considering that J48 will have no particular difficulty in generating axis-parallel hyperplanes to correctly separate the training instances in easier problems. However, in more complex problems in which there are a few instances whose classes cannot be separated by axis-parallel hyperplanes, it is possible that a clustering procedure that simplifies the input space in sub-spaces can be more effective in solving the problem.
The decision tree in Fig. 1 also recommends employing J48 for problems whose sparsity (measured by T2) is below a given threshold (the higher the value of T2, the denser the dataset). This implies that for sparser problems, in which J48 alone has problems in generating appropriate separating hyperplanes, clustering the training set may be a more effective option for classification. Though the dataset sparsity is not affected by clustering, one can assume that the generated sub-spaces are simpler to classify than the full sparse input space.
Next, we analyze the decision tree that recommends employing either Clusk or J48. The decision tree in Fig. 2 shows that for datasets whose number of nominal attributes surpass a given threshold (nom>7), J48 is the recommended algorithm. This can be intuitively explained by the difficulty that most clustering algorithms have of dealing with nominal values. For the remaining cases, measure of sparsity T2 is tested to decide between Clusk and J48. Notice that, similarly to the previous decision tree, clustering the dataset is recommended for sparser datasets, alleviating the difficulty on finding axis-parallel hyperplanes for class separation in sparse problems.
Our last recommendation is for users for whom interpretability is a strong need: if none of the scenarios highlighted before suggested that Clus-DTI is a better option than J48, it may be still worthwhile using it due to the reduced size of the trees it generates. It should be noticed that even though Clus-DTI may generate several decision trees for the same dataset, only one tree is used to classify each test instance, and hence only one tree needs to be interpreted at a time. We can think of the clustering step as a hidden (or latent) attribute that divides the data in subtrees, which are interpreted independently from each other. It is up to the final user to decide whether he or she is willing to waste some extra computational resources in order to analyze (potentially) more comprehensible trees.