 WTI
 Open Access
 Published:
DConfidence: an active learning strategy to reduce label disclosure complexity in the presence of imbalanced class distributions
Journal of the Brazilian Computer Society volume 18, pages311–330(2012)
Abstract
In some classification tasks, such as those related to the automatic building and maintenance of text corpora, it is expensive to obtain labeled instances to train a classifier. In such circumstances it is common to have massive corpora where a few instances are labeled (typically a minority) while others are not. Semisupervised learning techniques try to leverage the intrinsic information in unlabeled instances to improve classification models. However, these techniques assume that the labeled instances cover all the classes to learn which might not be the case. Moreover, when in the presence of an imbalanced class distribution, getting labeled instances from minority classes might be very costly, requiring extensive labeling, if queries are randomly selected. Active learning allows asking an oracle to label new instances, which are selected by criteria, aiming to reduce the labeling effort. DConfidence is an active learning approach that is effective when in presence of imbalanced training sets. In this paper we evaluate the performance of dConfidence in comparison to its baseline criteria over tabular and text datasets. We provide empirical evidence that dConfidence reduces label disclosure complexity—which we have defined as the number of queries required to identify instances from all classes to learn—when in the presence of imbalanced data.
Introduction
Classification tasks require a number of previously labeled instances. A major bottleneck is that instance labeling is a laborious task requiring significant human effort. This effort is particularly high in the case of text corpora and other unstructured data.
The effort required to retrieve representative labeled instances to learn a classification model is not only related to the number of distinct classes [2]. It is also related to class distribution in the available pool of instances. On a highly imbalanced class distribution, it is particularly demanding to identify instances from minority classes. These, however, may be important in terms of representativeness. Minority classes may correspond to specific information needs which are relevant for specific groups of users. In many situations, such as fraud detection, clinical diagnosis, news [35] and Web resource categorization [17], we face the problem of imbalanced class distributions.
The work described in this paper supports a broader goal related to the identification of representative instances for each class in the absence of previous descriptions of some or all the classes, in order to get a classification model that is able to fully recognize the target concept, including all the classes to learn no matter how frequent or rare they are. Furthermore, this must be achieved with a reduced number of labeled instances in order to reduce the labeling effort.
The aim of our current work is to evaluate the performance of our proposal, a new active learning strategy, w.r.t. its ability in finding representative instances of the classes to learn regardless of their distribution in the working set.
There are several learning schemes available for classification. The supervised setting allows users to specify arbitrary concepts. However, it requires a fully labeled training set, which is prohibitive when the labeling cost is high and, besides that, it requires labeled instances from all classes. Semisupervised learning [11] allows users to state specific needs without requiring extensive labeling [17] but still requires that labeled instances fully cover the target concept. Unsupervised learning does not require any labeling but users have no chance to tailor clusters to their specific needs. Therefore there is no guarantee that the induced clusters are aligned with the classes to learn. In active learning, which seems more adequate to our goals, the learner is allowed to ask an oracle (typically a human) to label instances—these requests are called queries. The most informative queries are selected by the learning algorithm instead of being randomly selected as in supervised learning.
In this paper we describe and evaluate the performance of dConfidence [19]. DConfidence is an active learning approach that tends to explore unseen regions in instance space, thus selecting instances from unseen classes faster—with fewer queries—than traditional active learning approaches. DConfidence selects queries based on a criterion that aggregates the posterior classifier confidence and the distance between queries and known classes. Confidence [4] and distance, farthestfirst [23], are traditional active learning criteria. DConfidence is biased towards instances that do not belong to known classes (low confidence) and that are located in unseen areas in instance space (high distance to known classes).
A workshop paper from 2008 [18] presents some preliminary results on the performance of dConfidence. These results are based mainly on artificial datasets with the purpose of realizing the ability of dConfidence in the early identification of rare instances.
These preliminary results were extended in [19]. This paper describes a systematic approach to the evaluation of dConfidence. It is based on artificial data and focused on comparing the performance of dConfidence to that of confidence w.r.t. the coverage of the instance space. Twodimensional artificial datasets have been generated to exhibit a set of properties describing global dataset characteristics: cluster alignment, label distribution, cluster morphism and cluster separability. All these properties were defined as binary. Sixteen artificial datasets have been generated covering all the combinations of these four binary metadescriptors expecting to simulate a wide range of real datasets’ structures arising in classification tasks. The empirical results showed that dConfidence selects queries from remote regions—where the density of known (labeled) instances is sparse—more efficiently than confidence. Instance space is covered more efficiently when using dConfidence, thus creating conditions to identify representative cases from unknown classes earlier. On average, a 100 % coverage of the instance space is achieved by dConfidence with a fraction of the effort required by confidence. Regarding the global properties of the datasets, dConfidence performed clearly better than confidence on “well behaved” datasets (balanced, collinear, isomorphic and separable). On not so well behaved datasets, dConfidence also performs better than confidence but not as clearly, especially with respect to the classification error.
DConfidence, using SVM as the base classifier, was evaluated over text corpora in two workshop papers. In [20] we compare the performance of dConfidence to that of confidence and random sampling, as a ground benchmark. The results from this paper show that dConfidence identifies exemplary instances for all classes faster that confidence. This gain in labeling effort is bigger for minority classes, which are the ones where the benefits are more relevant for our purposes. As a consequence the classification model generated by dConfidence is able of identifying more distinct classes faster. In [21] this work is continued by comparing dConfidence performance on text corpora to its baseline criteria (confidence and farthestfirst) with SVM base classifiers.
The current work extends previous results on dConfidence providing a comprehensive description and evaluation of this active learning strategy. It adds several contributions, including a formal description of dConfidence, the clear definition of its evaluation criteria, a comparative study of dConfidence w.r.t. to different base classifiers, a systematic evaluation of the dConfidence strategy against its baseline criteria over tabular and textual data with a main concern in the identification of rare instances in imbalanced data.
Our hypothesis is that dConfidence improves the performance of both its baseline criteria. On the one hand, it improves the exploitation behavior of confidence, which is required to prevent excessive accuracy decrease; on the other hand, it improves the exploratory behavior of farthestfirst, which is required to reduce the minimum number of queries needed to identify instances from all classes to learn.
Experimental outcomes led us to conclude that dConfidence is more effective than confidence and farthestfirst alone in achieving an homogeneous coverage of target classes.
In the rest of this paper we start by reviewing active learning, in Sect. 2. Section 3 describes dConfidence. The evaluation process is presented in Sect. 4 and we state our conclusions and expectations for future work in Sect. 5.
Active learning
Active learning [4, 13, 33, 36] is a particular form of supervised learning where instances to label are selected by the learner through some criteria aimed at reducing the label complexity [22], i.e., the number of label requests that are necessary and sufficient to learn the target concept.
In active learning, the learner is allowed to ask an oracle (typically a human) to label instances—these requests are called queries. The most informative queries, given the goals of the classification task, are selected by the learning algorithm instead of being randomly selected as is the case in passive supervised learning.
The term active learning has been originally coined in the education field in 1991, as a corollary of the broad discussion about instructional paradigms, which took place in the 1980s. It refers to the instructional activities involving students in doing things and thinking about what they are doing [8].
A few years before, the paradigm had already been applied to machine learning [4]. In this work the author sets a formal framework to study several types of query and their value for machine learning tasks. Although with some previous work performed by researchers, the term active learning seems to have been explicitly used in machine learning from 1994 on [13]. In this work the authors define active learning as any form of learning where the learner has some control over the input on which it trains.
Active learning approaches [13, 33, 36] reduce label complexity by analyzing unlabeled instances and selecting the most useful ones once labeled. Queries may be artificially generated [6]—the query construction paradigm—or selected from a pool [12] or a stream of data—the query filtering paradigm. Our current work is developed under the query filtering approach.
The general idea in active learning is to estimate the value of labeling one unlabeled instance. QueryByCommittee [38], for example, uses a set of classifiers to identify the instance with the highest disagreement. Schohn et al. [37] worked on active learning for Support Vector Machines (SVM) selecting queries—instances to be labeled—by their proximity to the dividing hyperplane. Their results are, in some cases, better than if all available data are used to train. Cohn et al. [14] describe an optimal solution for poolbased active learning that selects the instance that, once labeled and added to the training set, produces the minimum expected error. This approach, however, requires high computational effort. Previous active learning approaches (providing nonoptimal solutions) aim at reducing uncertainty by selecting queries as the unlabeled instances on which the classifier is less confident [29].
Batch mode active learning—selecting a batch of queries instead of a single one before retraining—is useful when computational time for training is critical. Brinker [9] proposes a selection strategy, tailored for SVM, that combines closeness to the dividing hyperplane—ensuring a reduction in the version space [32] close to one half—with diversity among selected instances—ensuring that newly added instances provide additional reduction of version space. Hoi et al. [24] suggest a batch mode active learning relying on the Fisher information matrix to ensure small redundancy among selected instances. Li et al. [30] compute diversity within selected instances from their conditional error. Hoi et al. [25] use batch mode active learning to increase the number of labeled instances and its diversity to improve SVM performance in each iteration.
Dasgupta [15] defines theoretical bounds showing that active learning has exponentially smaller label complexity than supervised learning under some particular and restrictive constraints. Kääriäinen extended this work by relaxing some of those constraints [28]. An important conclusion of this work is that the gains of active learning are much more evident in the initial phase of the learning process, after which these gains degrade and the speed of learning drops to that of passive learning. Agnostic Active learning [5], A^{2}, achieves an exponential improvement over the usual label complexity of supervised learning in the presence of arbitrary forms of noise. This model is studied by Hanneke [22] setting general bounds on label complexity.
All these approaches assume that we have an initial labeled set covering all the classes of interest. However, this assumption does not necessarily hold. In fact, collecting and annotating cases is a critical—being one of the first stages it might limit the performance of the following—and demanding stage—requires domain specialists to retrieve and label exemplary instances for all target classes—in classification tasks [30]. The effort in finding these exemplary instances depends not only to the number of target classes [2] but also to their distribution in the working set. On a highly imbalanced class distribution, it is particularly demanding to identify examples from minority classes. These, however, may be important in terms of representativeness. This is the case of a document collection on the Web.
Clustering has also been explored to provide an initial structure to data or to suggest valuable queries. Tat et al. [34] incorporate clustering into active learning by learning a classification model from the set of the cluster representatives, and then propagates the classification decision to the other instances via a local noise model. The proposed model allows to select the most representative instances as well as to avoid repeatedly labeling instances in the same cluster. Adami et al. [2] merge clustering and oracle labeling to bootstrap a predefined hierarchy of classes. Although the original clusters provide some structure to the input, this approach still demands for a high validation effort, especially when these clusters are not aligned with class labels. Huang et al. [27] explore the Wikipedia as a background knowledge base to create a conceptbased representation of a text document enabling the automatic grouping of documents with similar themes. The semantic relatedness between Wikipedia concepts is used to find constraints for supervised clustering using active learning.
Dasgupta et al. [16] propose a clusterbased method that consistently improves label complexity over supervised learning. Their method detects and exploits clusters that are loosely aligned with class labels. The method has been applied to the detection of rare categories. It obtained significant gains in the number of queries that are required to discover at least one instance from each class. This latter work is in line with our own efforts for devising a method capable to swiftly identify instances from unknown classes. Preliminary results have been published by us also in 2008 in a workshop paper [18]. Hu et al. [26] propose an active learning schema, based on graphtheoretic clustering algorithms, to suppress the lack of ability from common active learning approaches in selecting new instances that belong to new classes that have not yet appeared in the working set, and the lack of adaptability to changes in the semantic interpretation of sample classes.
An important issue in active learning is the establishment of a compromise between exploration—finding representative instances in the dataset that are useful to label, focusing on completeness—and exploitation—sharpening the classification boundaries, focusing on accuracy.
As described, common active learning methods select the queries which are closest to the decision boundary of the current classifier. They focus on improving the decision functions for previously labeled classes, i.e., they focus on exploitation. The work presented in this paper diverts classifier attention to other regions increasing the chances of finding new labels. DConfidence adds an exploration bias to active learning.
DConfidence active learning
Given a target concept with an arbitrary number of classes together with a sample of unlabeled examples from the target space (the working set), our purpose is to identify representative instances covering all classes while posing as few queries as possible, where a query consists of requesting a label to a specific instance. The working set is assumed to be representative of the class space—the representativeness assumption [31].
Active learners commonly search for queries in the neighborhood of the decision boundary (Fig. 1a), where class uncertainty is higher. The (perceived) uncertainty region is defined [13] as the area that is not determined by available information, i.e., the set of instances in the working set such that there are two hypotheses that are consistent with all training instances yet disagree on the classification of these instances. However, the perceived uncertainty region might be poorly mapping the real target concept, given current evidence.
Limiting instance selection to the perceived uncertainty region seems adequate when we have at least one labeled instance from each class in which case the perceived uncertainty region is probably consistent with the target concept. This class representativeness is assumed by the majority of active learning approaches. In such a scenario, selecting queries from the uncertainty region is very effective in reducing version space.
But what if the real uncertainty region is not correctly or fully perceived by the current hypothesis? Under such an assumption, favoring exploitation rather than exploration withholds the chances to achieve an early complete coverage of the target concept.
The intuition
Our main concern is related to the initial phase of the learning process—data collection and annotation—when we are still looking for exemplary instances to characterize the concept to learn. Under these circumstances and while we do not have labeled instances covering all classes, the uncertainty region perceived by the active learner (Fig. 1a) is reduced to a portion of the real uncertainty region (Fig. 1b). Being limited to this partial view of the concept, the learner is more likely to waste queries. The amount of the uncertainty region that the learner misses is related to the number of classes in the concept to learn that have not yet been identified.
Our intuition (Fig. 2) is that query selection should be based not only on classifier confidence but also on distance to previously labeled instances. In the presence of two instances with equally low confidence—say, X_{ a } and X_{ b } in Fig. 2—we prefer to select the one that is farther apart from what we already know, i.e., from previously labeled instances—referring to Fig. 2 we would prefer to query X_{ a } than X_{ b }. This bias improves the exploratory behavior of the active learning approach.
DConfidence
The most common active learning approaches rely on classifier confidence to select queries [4] and assume that the prelabeled set covers all the labels to learn. The performance of these approaches is focused on accuracy, favoring exploitation over exploration. Our scenario is somehow different: we do not assume that we have prelabeled instances from all classes and, besides accuracy, we are mainly concerned with the fast identification of representative instances from all classes.
To achieve our goals we propose a new selection criterion, dConfidence, which deals well with underrepresented classes. Instead of relying exclusively on classifier confidence we propose to select queries based on the ratio between classifier confidence and the distance to known classes. DConfidence, weighs the confidence of the classifier with the inverse of the distance between the instance at hand and previously known classes.
DConfidence is expected to favor a faster coverage of instance space, exhibiting a tendency to explore unknown regions. As a consequence, it provides better exploratory behavior than confidence alone. This drift towards unexplored regions and unknown classes is achieved by selecting the instance with the lowest dConfidence as the next query. Lowest dConfidence combines low confidence—probably indicating instances from unknown classes—with high distance to known classes—pointing to unseen regions in instance space. This effect produces significant differences in the behavior of the learning process. Common active learners focus on the uncertainty region, asking queries that are expected to narrow it down. The issue is that the portion of the uncertainty region that is perceived at a given moment is determined by the labels known at that moment. Focusing our search for queries exclusively in this region, while we are still looking for exemplary instances on some labels that are not yet known, is not effective. Unknown classes hardly come by unless they are represented in the current uncertainty region.
Algorithm 1 presents dConfidence, an active learning proposal specially tailored to achieve a fast class representative coverage.
W is the working set, a representative sample of instances from the problem space. L_{ i } is a subset of W. Members of L_{ i } are the instances in W whose labels are known at iteration i. C_{ i } is the set of the class labels that have representative instances in L_{ i }. U, a subset of W, is the set of unlabeled instances present in the working set. At iteration i, U_{ i } is the (set) difference between W and L_{ i }; h_{ i } represents the classifier learned at iteration i; q_{ i } is the query selected at iteration i; conf_{ i }(u_{ j },c_{ k }) is the posterior confidence on class c_{ k } given instance u_{ j }, at iteration i.
The core of our proposal is the computation of dConfidence values for unlabeled instances; this is accomplished at the outer for cycle in Algorithm 1 as explained next. At step (11) we select the next query as the instance with the minimum dConfidence. This query is then added to the labeled set (12) and the whole process iterates until a given stopping criteria is met. At the current implementation, the learning process stops when the unlabeled pool is exhausted.
Computing dConfidence
DConfidence is obtained as the ratio between confidence and distance among unlabeled instances and known classes (1). We may view dConfidence as the confidence per unit distance.
For a given unlabeled instance, u_{ j }, the classifier generates the posterior confidence w.r.t. known classes (7). The distance between unlabeled instance u_{ j } and all labeled instances in class c_{ k }, dist( ), is computed by ClassDist( ) at step (8). ClassDist( ) is an indicator of the distance between one instance and one group of instances (those belonging to a given class). The Euclidean metric was previously used in step (2) to compute the distance between all pairs of instances in W. This distance indicator, dist( ), is the median of the distances between instance u_{ j } and all instances in class c_{ k }. We expect the median to soften the effect of outliers. At step (9) we compute dconf_{ i }(u_{ j },c_{ k })—the marginal dConfidence for each known class, c_{ k }, given the instance u_{ j }—by dividing class confidence for a given instance by the aggregated distance to that class.
The maximum dConfidence on individual classes for a given instance u_{ j } is finally computed, at step (10), as the dConfidence of the instance, dConf_{ i }(u_{ j }).
Baseline criteria
DConfidence aggregates two baseline criteria, confidence and distance (based on farthestfirst). Confidence, generated at each iteration by the current version of the base classifier in use, is the posterior probability of class c_{ k } given u_{ j }. The aggregated distance to known classes, dist_{ i }(u_{ j },c_{ k }), is computed by ClassDist(u_{ j },c_{ k }) based on the individual distances between each pair of instances (2). Individual pair distances might be computed by any distance function—at the current implementation we are using the Euclidean distance. ClassDist(u_{ j },c_{ k }) may also be any aggregation function computed on the individual pair distances between one unlabeled instance u_{ j }∈U_{ i } and every labeled instance from class c_{ K }∈C_{ i } known at iteration i—at the current implementation we are using the median.
\(C_{i}^{k}\) is the set of labeled instances known at iteration i that belong to class c_{ k }, i.e., \(C_{i}^{k}=\{\langle x,y\rangle \in L_{i}:y=c_{k}\}\).
Effect of dConfidence on SVM
The output of SVM classifiers is the signed distance to the decision boundary measured in terms of half margin width—a case located on the decision boundary output 0 while an instance which is collinear with support vectors for class +1 generates an output 1 and an instance which is collinear with support vectors for class −1 generates an output −1. An instance with a distance to the decision boundary that is n times the distance between the boundary and a support vector output n. This distance, d, is transformed into p∈[0,1]—representing the posterior confidence of the learner on class +1.
If, as is commonly the case, this transformation is based on logistic regression (3), the SVM classifier will be very confident on instances that lie far from the decision boundary (Fig. 3a), reducing the chances to select queries far from the current uncertainty region.
To prevent this behavior and to direct the learner to low confidence instances but also to unexplored regions in instance space, the dConfidence value of a point is high in the neighborhood of known instances decreasing with the distance to those (Fig. 3b).
Evaluating dConfidence performance
The ultimate goal of our evaluation of dConfidence is to assess its ability to identify instances from unseen classes while querying for fewer labels without degrading accuracy when compared to its baseline criteria—confidence and farthestfirst. We have designed our evaluation plan with several objectives in mind:

first of all, we want to (a) compare the performance of dConfidence against its baseline criteria;

then we want to (b) assess the impact of the base classifier on the performance of dConfidence;

finally, we want to (c) determine whether the performance of dConfidence depends on the dimensionality of the input feature space. In particular, we want to determine whether dConfidence is appropriate for highdimensional unstructured datasets, mainly text.
The evaluation was performed over several base classifiers, several datasets and several query selection criteria, including dConfidence and its baseline criteria.
These objectives will be assessed from several performance indicators:

error and known classes (see Definition 1), evaluated at each iteration throughout the learning cycle and

firsthit (see Definition 2) and label disclosure complexity (see Definition 3), evaluated once for every combination of dataset, base classifier and query selection criterion.
Performance indicators
Our evaluation will be based on the performance indicators referred above: error, known classes, firsthit and label disclosure complexity.
To make these performance indicators clear, lets assume a generic classification task. C is the set of class labels to learn. C_{ i }⊆C is the set of class labels contained in a training set L_{ i }.
Active learning is an iterative process requiring some prior initialization. C_{1} is the set of labels that are represented in L_{1}, the initialization training set. At each iteration, new labeled instances, called queries, are added to the training set.
Error, is a common assessment criterion for classification tasks. We have computed the progress of the generalization error—the error in the test set—over all iterations as new labeled instances are added to the training set.
Known classes is the number of classes that have representative labeled instances in the training set at a given iteration.
Definition 1
Known classes, kc_{ i } is the cardinality of C_{ i }, i.e., the number of classes given for learning.
Firsthit is defined for each class. It is the number of queries required to identify the first instance of the class for a given dataset, base classifier and query selection criterion.
Definition 2
For each c_{ k }∈C, c_{ k }∉C_{1}, firsthit, fh_{ k }, is the number of queries required to identify the first instance of class c_{ k }. The initialization queries, the instances in L_{1}, are not accounted for.
Label disclosure complexity (LDC) aims to evaluate the ability of the learning process to reveal all the classes belonging to the concept to learn. LDC is inspired on label complexity [22], defined under the active learning setting as the number of queries that are sufficient and necessary to learn the target concept. LDC is the minimum number of queries being required to identify at least one instance from every class to learn. LDC equals the maximum firsthit computed over all the classes for a given combination of dataset, base classifier and query selection criterion.
Definition 3
Label disclosure complexity (LDC) is the minimum number of queries that are required to identify at least one instance from every c_{ k }∈C. LDC is equal to max_{ k }(fh_{ k }).
Experimental setting
The evaluation plan includes two phases, A and B.
Phase A covers objectives (a) and (b) set above (Sect. 4). The experiments in this phase were performed over tabular data. We have used five datasets from the UCI repository [1]:

Iris (one class is separable while the other two are not),

Cleveland heart disease (imbalanced class distribution),

a random sample from Vowels (higher number of distinct classes than the others),

a sample from Satlog (higher number of attributes than the others) and

a sample from Poker (highly imbalanced class distribution).
These datasets were selected for their properties, mainly due to their distinct class distributions (Table 1).
The purpose of phase A is to assess dConfidence on regular data avoiding to add extra disturbing factors that might come by when using unstructured data. As base classifiers, we have used a neural network (NNET), a decision tree (RPART) and Support Vector Machine classifiers with linear kernels (SVM).
Phase B covers objective (c) set above (Sect. 4). For phase B we have selected two highdimensional unstructured datasets. Two samples from traditional text corpora were used:

a stratified sample from the 20 Newsgroups corpus (NG), containing 500 documents described by 10333 terms, and

a stratified sample from the R52 set of the Reuters21578 collection (R52), containing 1000 documents described by 6019 terms.
The NG dataset has documents from 20 distinct classes while the R52 dataset has documents from 52 distinct classes.
These datasets have been selected for their distinct class distributions. The class distribution in NG is fairly balanced (Fig. 4a) with a maximum frequency of 35 and a minimum frequency of 20 while the R52 dataset presents an highly imbalanced class distribution (Fig. 4b). The most frequent class in R52 has a frequency of 435 while the least frequent has only two instances in the dataset. This dataset has 42 classes, out of 52, with a frequency below 10 from which 31 are below 5.
In phase B we have used SVM classifiers in all experiments. SVM are commonly referred as being among the most accurate classifiers for highdimensional input spaces, in general, and text, in particular [10].
The query selection criteria under evaluation are dConfidence and its baseline criteria: standard confidence and farthestfirst. The performance of these criteria on all datasets was estimated with 10fold crossvalidation. Folds are stratified random samples comprising a partition of the working set. Our aim is to compute the number of queries that are required to identify at least one instance from all classes—from which we can compute known classes, firsthit and LDC—and to compute generalization error.
The labels in the training set are initially hidden from the classifier being revealed as the learning process iterates. For each iteration, the active learning algorithm asks for the label of a single instance. For the initialization of each fold we give two prelabeled instances—from two distinct classes—to the classifier. These are randomly selected from the training set. Initialization instances for a given fold with labels already selected are disregarded. Given the fold, the same initial instances are used for all experiments.
The Poker dataset has a highly imbalanced dataset causing some exceptions. The two classes with frequency 1 from the Poker dataset are never selected as initial classes. Two out of the 10 folds used for crossvalidation do not include all the 10 classes in the Poker dataset. For this reason, the maximum number of classes found when using this dataset is below the total number of classes in the dataset, since it is estimated as a mean over all folds.
In all the experiments, in both phases, we have compared our dConfidence proposal against its baseline selection criteria: the common confidence active learning setting—where query selection is solely based on low posterior confidence of the current classifier—and farthestfirst—where query selection is based only on distance from training instances which is independent from the base classifier. Comparing these criteria against each other provides evidence on the performance gains, or losses, of dConfidence when compared to its baselines: confidence, and distance (farthestfirst).
We have performed significance ttests for the differences of the means observed when using farthestfirst, confidence and dConfidence. Statistically different means (significance level of 5 %) are presented in bold face.
In some cases we are using samples extracted from the whole dataset with fewer instances than those available. There is no loss of generality arising from this fact since the learning process converges, in respect to the indicators being measured, before those samples are exhausted.
Empirical results from phase A
In the first experimental phase we want to assess the ability of dConfidence to reduce LDC over its baseline criteria. In parallel we evaluate accuracy as well. This assessment was performed over a set of base classifiers to evaluate their effect on the performance of dConfidence.
In every experiment the training set starts with two prelabeled instances. At each iteration a new instance is queried for its label and added to the training set.
We have recorded the number of distinct labels identified and the error on the test set for each iteration, for every combination of dataset, base classifier and query selection criteria. From these, we have then computed the mean number of known classes and mean generalization error in each iteration over all crossvalidation folds.
The evolution of the error rate and the number of known classes for each dataset, when using SVM as a base classifier, is shown in Figs. 5a–5e with curves for each selection criteria under evaluation.^{Footnote 1} For convenience of representation, the mean number of known classes has been normalized to the total number of classes in the dataset thus being transformed into the percentage of known classes instead of the absolute number of known classes. This way the number of known classes and generalization error are both bounded in the same range (between 0 and 1) and we can conveniently represent them on the same chart. Means at each iteration are microaverages—all the instances are equally weighted—over all crossvalidation folds for a given combination of dataset, classifier and selection criterion.
The evolution of these indicators—generalization error and mean number of known classes—throughout all the learning cycle can be summed up to provide evidence on overall performance. Means in Table 2 are microaverages over all iterations for a given combination of dataset, classifier and query selection criteria, providing a perspective of the average performance of the query strategy throughout the learning cycle.
Besides the overall error and number of known classes we have also observed firsthit (Table 3). When computing firsthit for a given class we have omitted the experiments where the labeled set for the first iteration contains that class, following Definition 2.
From firsthit we compute LDC for each scenario (Table 4). LDC is the maximum firsthit for a given scenario. It provides the number of queries that are required by the active learning strategy to identify at least one instance from each class to learn, i.e., to achieve full coverage of the target concept.
Analysis of results from phase A
In phase A we evaluate the performance of dConfidence over tabular data w.r.t. representativeness, accuracy and firsthit. The influence of the base classifier on the learning strategy is also evaluated.
If we focus on SVM, which will be our base classifier for text corpora, we can observe in Table 2 that dConfidence performs better than confidence and farthestfirst, both at labeling effort and accuracy, over tabular datasets. The only exception occurs at the Poker dataset where the mean error over all the learning process is lower when using confidence.
The dominance of dConfidence throughout all the learning process is also observable from Fig. 5. This dominance is clear, both in terms of error and known classes, at Iris, Vowels and Satlog (Figs. 5a, 5c and 5d). Iris and Vowels have uniform class distributions while Satlog has a fairly balanced class distribution with a coefficient of variation equal to 42 %—the coefficient of variation is the ratio of the standard deviation to the mean. The same performance is also evident at the Cleveland dataset (Fig. 5b). Here, however, while the gain of dConfidence over confidence is clear it is not as salient over farthestfirst. The Cleveland dataset has one majority class with a frequency over 50 % and one underrepresented class with frequency below 5 %. The coefficient of variation is equal to 98 %. At the highly imbalanced Poker dataset (Table 1) dConfidence takes clear advantage over confidence w.r.t. known classes over all the learning process (Fig. 5e). We can also observe that dConfidence is outperformed by farthestfirst w.r.t. known classes at the initial quarter of the learning process—up to iteration 106—but overcomes it from there on. At this dataset, however, the error of dConfidence is clearly dominated by that of confidence at the initial stage of the learning process.
The differences in mean error gains are statistically significant at the Iris, Satlog and Vowels datasets in favor of dConfidence. At the other datasets—Cleveland and Poker—the difference is not statistically significant. The most relevant evidence is probably the fact that error does not degrade; in fact, it generally improves when using dConfidence, when compared to confidence and farthestfirst, with SVM base classifiers.
If we move now to the other classifiers—NNET (neural network) and RPART (decision tree)—over tabular data, we can observe a similar dominance. DConfidence achieves higher or equal means of known classes on all combinations except when using NNET and RPART over the Vowels dataset and RPART over Poker (Table 2). When it comes to the mean error rate, dConfidence does not perform has well has when relying on a SVM base classifier. DConfidence presents a lower mean error at the Iris dataset, when using neural networks or decision trees, and also at Cleveland and Poker when using NNET. At the other combinations, the error observed when using dConfidence as a query selection strategy is outperformed by the other strategies, although with no statistical significance.
DConfidence also outperforms confidence firsthit performance, in general. The same does not hold when comparing dConfidence and farthestfirst w.r.t. firsthit where we do not perceive clear evidence on the best performer.
If we sum the number of classes over all datasets, we can find a total of 35 classes over the five tabular datasets (three from Iris, five from Cleveland, 10 from Poker, six from Satlog and 11 from Vowels). These datasets have been submitted to three distinct classifiers (SVM, NNET and RPART). In total, for all the experiments, we have evaluated 105 classes.
We can observe that confidence firsthits classes before dConfidence only on 33 out of these 105 classes (Table 3). From these 33 cases, eight happen when using SVM as a base classifier, 12 when using NNET and 13 when using RPART. It is worthwhile noting that 17 out of these 33 cases occur at the Vowels dataset. The Vowels dataset has a uniform class distribution (30 instances per class). The added value of dConfidence is more evident at imbalanced datasets.
The Poker dataset—where two out of ten classes occur only on a single case corresponding to a relative frequency of 0.2 % and six classes have a frequency below 1 %—allows evaluating the early identification of underrepresented classes. The average firsthit computed from Table 3 over underrepresented classes—classes 5 to 10—shows that confidence is not appropriate to find rare instances (Table 5).
DConfidence outperforms both its baseline criteria w.r.t. the early identification of instances from underrepresented classes when using SVM and NNET as base classifiers. Farthestfirst however, takes the lead when using decision trees (RPART).
LDC provides further evidence supporting the improved performance of dConfidence over its baseline criteria. In fact, dConfidence has the lowest LDC on all combinations of dataset and classifier that have been evaluated on tabular data except on the Vowels dataset when using RPART as a base classifier (Table 4). The average gain on dConfidence LDC for all pairs dataset/classifier, when compared to confidence, on tabular data is of 542 %, meaning that confidence requires over six times more queries than dConfidence to identify all class labels. This figure, however, is highly biased by the outlier observed on Iris/NNET. Nevertheless, if we remove this outlier from our data we still have a gain of 101 % in LDC, meaning that, on average, confidence requires twice as many queries as dConfidence to achieve a full coverage of the classes to learn on all tabular datasets.
Performance under different levels of class imbalance
With the purpose of reinforcing the previous evidence supporting the ability of dConfidence when in presence of imbalanced data, we have evaluated the active learning strategies being studied under different levels of class imbalance. We have based this evaluation on the two datasets exhibiting a uniform distribution—Iris and Vowels—and on SVM base classifiers. The original training datasets were manipulated to ensure imbalanced class distributions. We have randomly sampled from each training fold a set of instances from given classes to be removed from the training dataset, thus achieving biased distributions with minority classes. Then we have repeated the learning process as before to these training data and collected the results described below.
From each dataset we have extracted four samples according to the process briefly described above. At Iris the number of instances from one of the classes—which will become the minority class—was reduced in those samples to 1, 3, 5 and 9, corresponding to a percentage of 2 %, 6 %, 11 % and 19 % relative to the frequency of each of the two remaining classes which kept their uniform distribution from the original training dataset.
At Vowels, a dataset with 11 classes, the number of instances from four of them—which will become the minority classes—was reduced in those samples to 1, 2, 3 and 6, corresponding to a percentage of 3 %, 7 %, 10 % and 21 % relative to the frequency of each of the remaining classes which kept their uniform distribution from the original training dataset.
The LDC computed from these experiments (Table 6) confirms the ability of dConfidence to retrieve rare instances in comparison to its baseline criteria.
On average, dConfidence presents a lower LDC than its baseline criteria on all settings except at Vowels with 3 % imbalance. We may observe the same scenario, with a significant dominance by dConfidence, when analyzing the empirical results on the number of known classes and on error (Table 7). DConfidence outperforms its baseline criteria with statistical significance at all settings except at the Vowels dataset with 21 % imbalance.
Common queries selection
Comparing the instances that are selected by each active learning strategy adds relevant information to our discussion. Are all strategies selecting the same instances at the same time throughout the learning cycle? We have investigated this question by measuring the percentage of common selected queries as the learning process iterates at Iris (Fig. 6a) and Vowels (Fig. 6b). Each curve in charts represents the average, computed over all crossvalidation folds at each iteration, of the percentage of common instances observed in the labeled sets used to train the classifier under the referred strategies—dConfidence (dc), confidence (c) or farthestfirst (ff).
It is clear from Fig. 6a that dConfidence and confidence query many common instances during the initial stage of the learning process at the Iris dataset. In fact, after the first 29 queries the labeled sets of both these strategies have nearly 60 % intersection. This level of overlapping then stabilizes to start increasing later as a consequence of the exhaustion of the unlabeled set which necessarily increases the interception between the labeled sets of all strategies.
The opposite behavior is observed when comparing farthestfirst with either confidence or dConfidence. Despite the fact that dConfidence and farthestfirst share many common instances at the very first iterations (60 %), this overlap drops fast getting close to 20 % after 11 queries.
At the Vowels dataset, the overlap between the labeled sets being built by all active learning strategies increases at a constant rate throughout the majority of the learning process. Only at the very beginning, during the initial 35 iterations, there is some difference in this behavior with dConfidence and farthestfirst querying more common instances than the rest. As observed also at the Iris dataset, confidence and farthestfirst are the strategies sharing less queries.
Empirical results from phase B
The evolution of the error rate and the number of known classes over text corpora is shown in Figs. 7a and 7b with curves for each selection strategy under evaluation.
Similarly to what we have done for phase A, the evolution of error and mean number of known classes throughout all the learning cycle has been also summed up to summarize overall performance on text corpora (Table 8).
Besides the overall number of queries required to retrieve labels from all classes and generalization error, we have also observed firsthit (Tables 9 and 10). When computing firsthit for a given class we have excluded the experiments where the labeled set for the first iteration contains instances from that class.
The learning process for the R52 dataset was halted after 600 iterations, before exploring the full unlabeled pool—the working set had 1000 instances, 900 of which were used for training in each fold. All the class labels to learn were identified after 600 iterations for all the selection criteria, except for farthestfirst. The mean number of known classes after 600 iterations equals 52 for confidence and dConfidence, meaning these criteria have achieved full coverage of the class labels to learn in all the crossvalidation folds. For farthestfirst this mean is 50.3 which means that farthest first cannot identify all class labels in all crossvalidation folds after 600 iterations. Farthestfirst missed in several folds, six classes with frequency of two, two classes with frequencies of three and one class with a frequency of four. In such cases we have assigned a firsthit value of 601 for the unidentified classes. For instance, in a given fold where farthestfirst misses two classes their firsthit values are assumed to be 601 and 602—the very first queries after halting the learning process at 600 iterations. Firsthit means were computed on this assumption. Under such circumstances this is the most favorable assumption for farthestfirst.
We have computed LDC—the number of queries that are required to identify at least one instance from each class to learn—from firsthit, according to Definition 3 for each scenario (Table 11).
Analysis of results from phase B
Figure 7a shows that there is no clear dominance, neither from dConfidence nor from farthestfirst, when finding unknown classes in the NG dataset. However, both these criteria outperform confidence at this dataset. The difference between mean firsthit of dConfidence and farthestfirst in Table 9—20.45 for farthestfirst and 21.57 for dConfidence—is not statistically significant, at a 5 % significance level.
DConfidence accuracy dominates that of farthestfirst (Fig. 7b). The mean accuracy of dConfidence over all iterations is 2 % better that the one of farthestfirst. This result is significant at 5 % significance. The NG dataset has a fairly balanced class distribution. On the R52 dataset, which has an highly imbalanced class distribution, we can observe very distinctive performance (Fig. 7b).
In R52, farthestfirst starts by identifying unknown classes a little faster than dConfidence (Fig. 7a). However, after the initial learning stage, dConfidence outperforms and dominates farthestfirst. When identifying unknown classes, farthestfirst leads, up to the 45th query, on average, taking a maximum advantage of two classes after 37 queries. After 45 queries, with 13.2 classes identified on average, dConfidence clearly dominates farthestfirst.
It is interesting to notice that farthestfirst beats dConfidence on the majority classes (Tables 10) but, when all majority classes are found and only minority classes are left unexposed, dConfidence reveals its ability to find rare instances. The mean frequency of the classes that are first found in R52 by dConfidence is 3.2, while it is 12.5 for confidence and 33.8 for farthestfirst.
If we take a step back to analyze dConfidence firsthit against farthestfirst on the highly imbalanced Poker dataset we may find some unexpected outcome. In this case, farthestfirst generally outperforms dConfidence in finding rare instances, contrary to what happens in text corpora. This is probably a sign that distance might be a better discriminator in lowdimensional input spaces than it is in highdimensional input spaces.
Distance functions might lose their usefulness in highdimensional spaces where the distance to the nearest and farthest neighbors come very similar—the curseofdimensionality [7]. This effect is most noticeable when using L_{ k }norm distances with a high value of k (k≥3). Euclidean distance, a L_{2}norm metric, is not much affected [3]. To assess this effect on our datasets we have computed the relative contrast—measuring the relative distance of the nearest and farthest neighbors of a given query—for all instances in each dataset (4). In Table 12^{Footnote 2} we can observe that the discrimination between the nearest and farthest neighbors is not too sensitive to the data dimensionality. Despite the fact that the minimum relative contrast exhibits a negative correlation of 64 % to the data dimensionality, the maximum relative contrast is not correlated and there is no evidence that highdimensional data are affecting the distance metric in use. The lack of correlation between the global contrast measure and data dimensionality supports this conclusion.
At the R52 dataset the difference of mean error is significant in favor of confidence. DConfidence reduces the labeling effort that is required to identify instances in R52, exhibiting better representativeness capabilities in this corpus. However, the error rate gets worse. Apparently, dConfidence gets to know more classes from the target concept earlier although less sharply. In the R52 dataset we are exchanging accuracy for representativeness.
A similar analysis on the LDC for text corpora (Table 11) is not as clear on dConfidence improvement. DConfidence outperforms confidence on the R52 corpus, with a lower LDC by 22 % but confidence outperforms dConfidence on NG, with a lower LDC by 34 %. Nevertheless it is relevant that dConfidence, once again, performs better on imbalanced data.
Figures 8 provide additional evidence on the ability of dConfidence to find rare instances. These charts, where classes are sorted by increasing frequency, show that dConfidence ensures a significant reduction in the mean number of queries that are required to first hit classes in R52. This reduction is more important in minority classes, i.e., in the first classes appearing in the horizontal axis. These charts represent the difference in dConfidence firsthit compared to their baseline criteria. Negative differences mean that dConfidence performed better, i.e., found representative instances of the class with fewer queries than its baseline criteria.
The dashed trend lines represented in both charts (Fig. 8), with a positive slope clearly show that the gain in dConfidence firsthit, when compared to its both baseline criteria, decreases when the class frequency increases.
Another perspective of these results may clarify our point of view. In Fig. 9a we give, for each different value of class frequency in the working set, the number of classes that were first found by each criteria—lowest firsthit among all criteria. Figure 9b represents the accumulated number of first found classes. As detailed below, both these charts show evidence on the improved ability of dConfidence to find exemplary instances of underrepresented classes.
When comparing dConfidence against farthestfirst we can observe that from the 17 classes in R52 that have a frequency of 2, dConfidence finds 11 before farthestfirst. From the 12 classes with a frequency of 3, dConfidence finds 10 before farthestfirst. From the 13 classes with frequency between 4 and 9, dConfidence finds 10 with fewer queries than farthestfirst. From the remaining 10 classes, with a frequency between 11 and 435, dConfidence finds only two before farthestfirst.
A similar comparison against confidence shows similar results. From the 17 classes in R52 that have a frequency of 2, dConfidence finds 13 before confidence. From the 12 classes with a frequency of 3, dConfidence finds 10 before confidence. From the 13 classes with frequency between 4 and 9, dConfidence finds 10 with fewer queries than confidence. From the remaining 10 classes, with a frequency between 11 and 435, dConfidence finds five before confidence.
Prevailing outcomes
The experimental results from both phases provide evidence on the performance of dConfidence towards our objectives (Sect. 4).
Base classifier
The performance of dConfidence seems to be slightly affected by the base classifier, mainly w.r.t. error. When referring to known classes, dConfidence generally improves over its base classifiers. DConfidence is suited for SVM classifiers where it generally improves over its baseline criteria. When using other base classifiers the performance is affected but improvements are still observable.
Confidence vs. dConfidence
If we focus on SVM, we can observe that dConfidence performs better than confidence, both at labeling effort and accuracy, over tabular datasets as well as over text corpora. DConfidence dominates confidence w.r.t. known classes throughout all the learning process. DConfidence also outperforms confidence firsthit performance, in general. This dominance is also evident w.r.t. error except on highly imbalanced datasets where confidence takes the lead.
Farthestfirst vs. dConfidence
DConfidence clearly dominates farthestfirst w.r.t. error when using SVM classifiers. The relative performance of these two criteria when it comes to known classes depends on the class distribution at the working set. On balanced datasets, dConfidence clearly outperforms farthestfirst. On imbalanced datasets dConfidence still outperforms farthestfirst on average; however farthestfirst generally beats dConfidence in finding majority classes.
Data dimensionality
The dimensionality of the input feature space does not compromise dConfidence that exhibits performance improvements over its baseline criteria at tabular lowdimensionality data as well as at highdimensional text corpora. However, some experimental results show that, unexpectedly, farthestfirst outperforms dConfidence when finding rare instances in imbalanced lowdimensional data (Poker dataset) while the same is not observed in highdimensional data (R52 corpus). This has probably to do with the better discriminative abilities of distance at lowdimensional input spaces when compared to highdimensional input spaces. This might require a parameter to tune the relative weight of confidence and distance in dConfidence.
Balanced vs. imbalanced class distributions
In general, dConfidence outperforms its baseline criteria in finding exemplary instances from all the target classes. The gain is particularly relevant when finding underrepresented classes in presence of highly imbalanced data. This gain however is achieved at the cost of accuracy. When in presence of imbalanced data, the exploratory bias of dConfidence promotes exchanging accuracy for representativeness.
Conclusions and future work
The evaluation procedure that we have performed provided statistical evidence on the performance of dConfidence when compared to its baseline criteria—confidence and farthestfirst. DConfidence reduces the labeling effort and identifies exemplary cases for all classes faster than confidence and farthestfirst alone. This gain is higher for minority classes, which are the ones where the benefits of dConfidence become more relevant.
The base classifier used in the learning process has some influence on accuracy but apparently not on the labeling effort. DConfidence consistently presents lower label disclosure complexity irrespectively of the base classifier. When it comes to error, the models generated by SVM classifiers seem to take better advantage of dConfidence than neural networks or decision trees.
DConfidence performs better in imbalanced datasets where it provides significant gains that greatly reduce the labeling effort. However, dConfidence consistently outperforms confidence and farthestfirst in terms of label complexity.
In general, dConfidence improves the performance of its baseline criteria both from the exploration point of view—finding unknown classes faster—and from the exploitation point of view—improving, although marginally, the accuracy—when applied to tabular, lowdimensional data.
When applied to text corpora, farthestfirst was outperformed by dConfidence on the imbalanced corpus and presented similar performance on the balanced corpus, in terms of finding unknown classes, but with lower accuracy.
In general, dConfidence achieved better performance on the imbalanced corpus than on the balanced one. The main drawback of dConfidence when applied on the imbalanced text corpus is that the reduction in the labeling effort that is achieved in identifying unknown classes is obtained at the cost of increasing error. This increase in error is probably due to the fact that we are diverting the classifier from focusing on the decision function of the majority classes to focus on finding new, minority, classes. As a consequence the classification model generated by dConfidence is able of identifying more distinct classes faster but gets less sharp in each one of them. This is particularly harmful for accuracy since a fuzzier decision boundary for majority classes might cause many erroneous guesses with a negative impact on error.
We are now exploring semisupervised learning to leverage the intrinsic value of unlabeled instances so we can benefit from the reduction in labeling effort provided by dConfidence and improve accuracy.
Comparing the instances that are being selected by each active learning strategy—for instance, by computing the percentage and class distribution of common selected instances as the learning process evolves—might help understanding operating patterns from each strategy. Although this line of work is in progress, the preliminary results reveal the distinction between confidence and farthestfirst strategies.
Calculating distances between documents may be demanding and cause other limitations to dConfidence. This effort can be reduced by first preselecting a subset of documents using a less demanding process and only then choosing the document to label. This is another line of future work.
Another fundamental aspect of active learning that we are focused on is the definition of a stopping criteria so we can decide when to stop querying.
Notes
 1.
We will use the following notation to refer to results in tables and charts: ff stands for farthestfirst, c stands for confidence and dc stands for dConfidence. Generalization error will be referred by e, kc will refer to the mean number of known classes and ldc refers to LDC.
 2.
Notation: min.rc and min.rc stand for the minimum and maximum relative contrast observed in each dataset; global.rc is a global contrast measure for each dataset computed by (4) but using the maximum and minimum distances between all the instances in the dataset.
References
 1.
Uc irvine machine learning repository (2009). http://archive.ics.uci.edu/ml/
 2.
Adami G, Avesani P, Sona D (2005) Clustering documents into a web directory for bootstrapping a supervised classification. Data Knowl Eng 54:301–325
 3.
Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional spaces. In: Proceedings of the 8th international conference on database theory, ICDT’01. Springer, London, pp 420–434. http://dl.acm.org/citation.cfm?id=645504.656414
 4.
Angluin D (1988) Queries and concept learning. Mach Learn 2:319–342. doi:10.1007/BF00116828
 5.
Balcan MF, Beygelzimer A, Langford J (2006) Agnostic active learning. In: ICML, pp 65–72.
 6.
Baum E (1991) Neural net algorithms that learn in polynomial time from examples and queries. IEEE Trans Neural Netw 2:5–19
 7.
Bellman RE (1957) Dynamic programming. Princeton University Press, Princeton
 8.
Bonwell CC, Eison JA (1991) Active learning: creating excitement in the classroom. JosseyBass, San Francisco
 9.
Brinker K (2003) Incorporating diversity in active learning with support vector machines. In: Proceedings of the twentieth international conference on machine learning
 10.
Chakrabarti S (2002) Mining the Web: discovering knowledge from hypertext data. Morgan Kauffman, San Mateo. http://www.cse.iitb.ac.in/~soumen/miningtheweb/
 11.
Chapelle O, Schoelkopf B, Zien A (eds) (2006) Semisupervised learning. MIT Press, Cambridge
 12.
Cohn D, Atlas L, Ladner R (1990) Training connectionist networks with queries and selective sampling. In: Advances in neural information processing systems
 13.
Cohn D, Atlas L, Ladner R (1994) Improving generalization with active learning. Mach Learn 15:201–221. doi:10.1023/A:1022673506211. http://portal.acm.org/citation.cfm?id=189256.189489
 14.
Cohn D, Ghahramani Z, Jordan M (1996) Active learning with statistical models. J Artif Intell Res 4:129–145
 15.
Dasgupta S (2005) Coarse sample complexity bonds for active learning. In: Advances in neural information processing systems, p 18
 16.
Dasgupta S, Hsu D (2008) Hierarchical sampling for active learning. In: Proceedings of the 25th international conference on machine learning
 17.
Escudeiro N, Jorge A (2006) Semantics, web and mining. In: Semiautomatic creation and maintenance of web resources with web Topic. LNCS, vol 4289. Springer, Heidelberg, pp 82–102
 18.
Escudeiro N, Jorge A (2008) Learning partially specified concepts with dconfidence. In: Brazilian symposium on artificial intelligence, web and text intelligence workshop
 19.
Escudeiro N, Jorge A (2009) Efficient coverage of case space with active learning. In: Lopes LS, Lau N (eds) Progress in artificial intelligence, proceedings of the 14th Portuguese conference on artificial intelligence (EPIA 2009), vol 5816. Springer, Berlin, pp 411–422
 20.
Escudeiro N, Jorge AM (2010) DConfidence: an active learning strategy which efficiently identifies small classes. In: Proceedings of the NAACL HLT 2010 workshop on active learning for natural language processing, association for computational linguistics, Los Angeles, CA, pp 18–26. http://10.255.0.115/pub/2010/EJ10
 21.
Escudeiro N, Jorge AM (2010) Reducing label complexity in the presence of imbalanced class distributions. In: Proceedings of the III international workshop on web and text intelligence (WTI—2010), São Bernardo do Campo, São Paulo, Brazil. http://10.255.0.115/pub/2010/EJ10a
 22.
Hanneke S (2007) A bound on the label complexity of agnostic active learning. In: Proceedings of the 24th international conference on machine learning
 23.
Hochbaum D, Shmoys D (1985) A best possible heuristic for the kcenter problem. Math Oper Res 10(2):180–184
 24.
Hoi S, Jin R, Lyu M (2006) Largescale text categorization by batch mode active learning. In: Proceedings of the world wide web conference
 25.
Hoi SCH, Jin R, Zhu J, Lyu MR (2009) Semisupervised svm batch mode active learning with applications to image retrieval. ACM Trans Inf Syst 27(3):1–29. doi:10.1145/1508850.1508854
 26.
Hu W, Hu W, Xie N, Maybank S (2009) Unsupervised active learning based on hierarchical graphtheoretic clustering. Trans Syst Man Cybern, Part B 39(5):1147–1161. doi:10.1109/TSMCB.2009.2013197
 27.
Huang A, Milne D, Frank E, Witten IH (2008) Clustering documents with active learning using Wikipedia. In: ICDM’08: proceedings of the 2008 eighth IEEE international conference on data mining. IEEE Comput. Soc., Washington, pp 839–844. doi:10.1109/ICDM.2008.80
 28.
Kääriäinen M (2006) Active learning in the nonrealizable case. In: Algorithmic learning theory. Springer, Berlin/Heidelberg, pp 63–77
 29.
Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In: SIGIR’94: proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval. Springer, New York, pp 3–12
 30.
Li M, Sethi I (2006) Confidencebased active learning. IEEE Trans Pattern Anal Mach Intell 28:1251–1261
 31.
Liu H, Motoda H (2001) Instance selection and construction for data mining. Kluwer Academic, Dordrecht
 32.
Mitchell TM (1997) Machine learning. McGrawHill, New York
 33.
Muslea I, Minton S, Knoblock CA (2006) Active learning with multiple views. J Artif Intell Res 27:203–233
 34.
Nguyen HT, Smeulders A (2004) Active learning using preclustering. In: Proceedings of the 21st international conference on machine learning. ACM, New York, pp 623–630
 35.
Ribeiro P, Escudeiro N (2008) Online news “à la carte”. In: Proceedings of the European conference on the use of modern information and communication technologies
 36.
Roy N, McCallum A (2001) Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of the eighteenth international conference on machine learning, ICML’01. Morgan Kaufmann, San Francisco, pp 441–448. http://portal.acm.org/citation.cfm?id=645530.655646
 37.
Schohn G, Cohn D (2000) Less is more: active learning with support vector machines. In: Proceedings of the international conference on machine learning
 38.
Seung H, Opper M, Sompolinsky H (1992) Query by committee. In: Proceedings of the 5th annual workshop on computational learning theory
Author information
Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Escudeiro, N.F., Jorge, A.M. DConfidence: an active learning strategy to reduce label disclosure complexity in the presence of imbalanced class distributions. J Braz Comput Soc 18, 311–330 (2012). https://doi.org/10.1007/s1317301200693
Received:
Accepted:
Published:
Issue Date:
Keywords
 Active learning
 Imbalanced data
 Label disclosure complexity
 Text classification