Partially labeled data stream classification with the semisupervised Kassociated graph
 João Roberto BertiniJr.^{1}Email author,
 Alneu de Andrade Lopes^{1} and
 Liang Zhao^{1}
https://doi.org/10.1007/s1317301200728
© The Brazilian Computer Society 2012
Received: 15 June 2011
Accepted: 22 March 2012
Published: 17 April 2012
Abstract
Regular data classification techniques are based mainly on two strong assumptions: (1) the existence of a reasonably large labeled set of data to be used in training; and (2) future input data instances conform to the distribution of the training set, i.e. data distribution is stationary along time. However, in the case of data stream classification, both of the aforementioned assumptions are difficult to satisfy. In this paper, we present a graphbased semisupervised approach that extends the static classifier based on the Kassociated Optimal Graph to perform online semisupervised classification tasks. In order to learn from labeled and unlabeled patterns, here we adapt the optimal graph construction to simultaneously spread the labels in the training set. The sparse, disconnected nature of the proposed graph structure gives flexibility to cope with nonstationary classification. Experimental comparison between the proposed method and three stateoftheart ensemble classification methods is provided and promising results have been obtained.
Keywords
1 Introduction
Recently, graphbased (also referred to networkbased) algorithms applied to data mining tasks have attracted great attention in both theoretical research and practical applications [5]. This growing interest is mostly justified due to the advantages provided by graph representation, such as revealing topological structure of input data and the ability of identifying arbitrary shapes of data clusters [27]. In such graphbased algorithms, each vertex of the graph represents a data pattern (data instance) and the edges stand for some relation of similarity between vertices. In order to reveal significant relations within a data set, the following rule is usually considered for establishing connections between data patterns: the higher the similarity among data, the higher the probability of connection [39]. Stated in this way, nearby patterns tend to be heavily linked together while distant patterns may form a sparse structure. This property has been extensively explored using graphbased solutions, especially considering unsupervised tasks like clustering [32] and dimensionality reduction [1]. Only recently graphbased classification has been addressed, usually by the wrap of semisupervised learning [38].
Semisupervised learning methods concern the problem of automatic classification considering data sets with a small number of labeled data and a large amount of unlabeled data [7]. Such approach relies on the fact that labeled data are difficult to be gathered and often are associated with high costs, while unlabeled data are abundant in most applications and generally easy to be collected. Moreover, the manually labeling process is not always reliable or practicable. For example, consider obtaining enough labeled data to train a classifier for a spam detection task (i.e. classifying spam and valid email). Such application design (1) incurs cost in paying an expert or a group of users to label what they call spam from what they consider real email; (2) may result in inconsistencies if we accept all human categorization. For instance, an email message may be considered as a spam by some people, but it may be considered as a valid email by others; (3) not to mention the time required to manually label enough data to train a regular supervised learning method.
A spam detection application is really a stream classification problem, in the sense that the classifier needs to classify new patterns at the time they arrive [35]. In this kind of applications, the underlying data distribution changes over time, and such changes often make the model built on old data inconsistent to the newly arrived data. This problem, known as concept drift [34], requires frequent updating of the model. Summarizing, we have a classification problem which consists of a data stream where few instances are labeled and data distribution may change over time. This scenario poses a challenging task for machine learning because it presents too few labeled data along the stream to apply a supervised incremental algorithm and the presence of concept drift disables the use of static classifiers. In fact, only recently such applications have been properly addressed due to the concept of learning through both labeled and unlabeled data and the development of semisupervised learning strategies.
In the development of semisupervised learning algorithms, many efforts have been made on the use of a clustering algorithm to group the patterns and further spread the labels. When considering this approach, the Kmeans algorithm is a natural choice. Li et al. [20] proposed a treebased algorithm which uses the Kmeans to spread labels at the leaves of a tree. Masud et al. [22] proposed an ensemble of microclusters, obtained by using the Kmeans algorithm, then instances are classified according to the Knearest neighbor rule. Ditzler and Polikar [11] proposed an ensemble of classifiers, named WEA, which are trained with labeled patterns only. Then, unlabeled data and the Kmeans algorithm are used to generate a mixture of Gaussian models for further adjusting the weights of each classifier. Zhang et al. [37] use the semisupervised SVM [8] allied to a version of the Kmeans, referred to as relational Kmeans, to construct new features to the labeled examples by using information extracted from unlabeled instances. Some investigations have been made to tackle specific problems, e.g. Erman et al. [12] proposed a method to perform traffic classification in computer networks with partially labeled data. Their method uses a clustering algorithm, such as Kmeans, to obtain the clusters and then, the labels are spread using the maximum likelihood estimation. The clusters that remain unlabeled are likely to be an undefined group. Also regarding computer network, Yu et al. [36] have considered the problem of intrusion detection. They employ a strategy similar to the Kmeans by grouping the labeled data and then, the labels are spread to the whole data set according to the distances from the clusters to unlabeled patterns. Finally, a SVM is trained to detect intrusion.
To the best of our knowledge, graphbased approach has not been considered to tackle streaming classification problems where data are partially labeled; although it is successfully applied to semisupervised learning, especially to the transduction problem [2, 6, 10, 25]. In view of the recent developed graphbased nonparametric classification method and its good performance on stationary data sets [4, 21]; we had proposed a nonstationary version with initial results reported in Ref. [3]. In this paper, we propose an extended version to be applied in the context of nonstationary stream of partially labeled data. The aforementioned graphbased method is based on representing the training set as a special graph, referred to as Kassociated graph. The Kassociated graph is able to represent similarity relations among data instances and the purity of a component (connected subgraph) is able to represent the data topology. Purity characterizes the degree in which instances of different classes are mixed in a same region of the data space. In this work we propose a new constructing procedure for the Kassociated graph that takes into account partially labeled sets. Also, this work shows how the graph is updated along the time to allow data stream processing.
The remainder of the paper is organized as follows: In Sect. 2, we briefly describe the problem of concept drift and also a toy example to illustrate a scenario where incremental learning is applicable. Section 3 presents the proposed method for nonstationary partially labeled stream classification. This section is further divided into four subsections, where Sect. 3.1 first introduces the Kassociated graphs and the Kassociated optimal graph. The new method for constructing the aforementioned graphs from partially labeled data sets is described in Sect. 3.2. Moreover, Sect. 3.3 briefly treats the static KAOG classifier [4] and Sect. 3.4 details how the graph is updated over time. Section 4 presents the experimental results concerning the performance comparison between the proposed algorithm and three wellknow fully supervised streaming ensemble classifiers on nonstationary partially labeled benchmarks. Section 5 concludes the paper and discusses some future works.
2 Background

A priori of classes P(ω_{1}),…,P(ω_{ M }), i.e. alteration on the relative size of a given class or the appearance of new classes.

Conditional P(x∣ω_{ i }), i.e., changing on class definition. For example, changes in the shape of a class.

Conditional a posterioriP(ω_{ i }∣x), i.e., modification on some of the attributes;
In Fig. 1, the (blue) rectangles represent the instances that belong to class ω_{1} and the (red) circles represent the instances belonging to class ω_{2}. Consider Figs. 1(a)–(d) as a sequence of data distributions of an application presented in time, initiating at t_{0}. The concept drifts that occur between distributions of Figs. 1(a) and 1(b), as well as between Figs. 1(b) and 1(c), are abrupt. Also notice that the distribution shown by Fig. 1(c) is similar to that in Fig. 1(a), which mean that the distribution at time t_{0} in Fig. 1(a) occurs again at time t_{j+1}, after experiencing a completely different distribution (Fig. 1(b)). This phenomenon characterizes a recurrent concept. As the time line shows, from Figs. 1(a) and 1(d), each distribution can, eventually, remain static for a given period of time, e.g., the initial distribution remains static from t_{0} to t_{ i }. Nonetheless, on the next iteration t_{i+1}, the distribution can be totally altered, i.e. an abrupt drift occurs. The drift between distributions in Figs. 1(c) and 1(d) is also considered abrupt, in spite of being less severe than the previous one. Consider now a situation where two groups of data from different classes cross each other along time, depicted in order in Figs. 1(e)–(g), from an initial distribution (Fig. 1(e)) to a final one (Fig. 1(g)) with Fig. 1(f) corresponding to an intermediate distribution. In such a scenario, the distribution varies smoothly throughout the time, which characterizes a gradual drift. At last, let Fig. 1(h) represent a distribution determined by a rotating hyperplane along the time. If the hyperplane is rotated by π/4 regularly at a given period of time, the drift is characterized as gradual and recurrent at every eight alterations of the hyperplane. However, if the angular velocity rate is increased, say to π, the drift now can be considered abrupt. This demonstrates that it is surprisingly difficult to accurately characterize concept drifts considering only velocity and recurrence. In view of this problem, many researchers have proposed different drift categories; for a recent work, refer to Ref. [23].
In spite of characterizing concept drift, the main concern is that, most of the time, the variation in the underlying data distributions degenerates the performance of the classifier in use. The need for replacing a classifier due to the drop in accuracy, caused or not by a concept drift, is called virtual concept drift [18]. The trivial way to treat virtual drift is to replace the low accuracy classifier by a new one. However, such strategy brings at least three prohibitive drawbacks, (i) retraining new classifiers usually is computationally expensive; (ii) detecting when the current classifier is no longer useful is quite challenging, mainly due to the natural fluctuations in performance that can be confused with real concept drift; (iii) selecting what data should be used to train the new classifier is also a hard task. Fortunately, incremental learning algorithms can be applied to provide practical solutions to tackle classification problems on nonstationary domains. Such an approach enables a classifier to acquire knowledge during application phase, updating the model with new data, and without explicitly retraining itself [14, 30].
Figure 2(b) shows the results of the comparison between the Kassociated static and incremental classifiers. The significant difference between them is due to the fact that the static classifier no longer learns with new instances, however the incremental classifier is able to learn during classification phase. The presented incremental learning process is analogous to the linearization technique widely used to study local properties of nonlinear systems. Specifically, linearization of a neighborhood of a certain point corresponds to subset selection in incremental learning. Nonlinearity of the system corresponds to twisted shape of classes and changing of data distribution over time. In a nonlinear systems, linearization usually can obtain good approximation if the neighborhood under analysis is small. For the same reason, we expect that good classification results can be obtained by updating the network with small data subset each time.
3 The semisupervised Kassociated optimal graph
The semisupervised Kassociated graph, proposed here, consists of a modification of the Kassociated graph [4] to deal with both labeled and unlabeled data during the graph construction procedure. Therefore, in order to introduce the semisupervised version, a brief revision of the Kassociated graph is presented in Sect. 3.1. It is followed by the semisupervised Kassociated graph construction presented in detail in Sect. 3.2. Both supervised and semisupervised Kassociated optimal graphs can be seen as the training process for the KAOG classifier which uses the components of the graph and their purities to classify new data instances, as will be exposed in Sect. 3.3.
3.1 The Kassociated graph and the Kassociated optimal graph
A Kassociated graph is constructed from a vectorbased data set X={x_{1},…,x_{ N }} by representing each data instance x_{ i }=(x_{i1},x_{i2},…,x_{ ip },c_{ i }) as a vertex v_{ i } with its associated class label c_{ i }, where c_{ i }∈Ω={ω_{1},ω_{2},…,ω_{ M }} and M is the number of classes in the problem. The graph construction resembles to a KNN graph, due to the use of a predefined number of neighbors, K, that each vertex must connect. Although the Kassociated graph does differ from the KNN approach by the fact that amongst the possible K neighbors of a vertex v_{ i }, it can only be connected to neighbors of the same class as v_{ i }. Hence, we consider the labelindependent and the labeldependent Kneighborhood of vertex v_{ i }. The former is simply the set of vertices that represents the K nearest neighbors of the instance x_{ i } according to a given measure and will be noted by \(\varLambda_{v_{i},K}\). The latter comprises only the vertices with the same class as v_{ i } among its K nearest neighbors, and is defined as \(\varDelta _{v_{i},K} = \{v_{j}\mid v_{j} \in\varLambda_{v_{i},K}\ \mathrm{AND}\ c_{i} = c_{j}\}\).
In a formal way, the Kassociated graph is defined as a directed graph G=(V,E) which consists of a set of labeled vertices V and a set of edges E between them, where an edge e_{ ij }=(v_{ i },v_{ j }) connects vertex v_{ i } with vertex v_{ j } if and only if \(v_{j} \in \varDelta _{v_{i},K}\). As a consequence, only vertices of the same class can be connected. The resulting Kassociated graph can be viewed as a set of disjoint subgraphs or componentsC={C_{1},…,C_{ α },…,C_{ R }}. Each component C_{ α } is composed by vertices of a single class, thus each component represents a single class, which we refer to the label of component C_{ α } as \(\hat{C}_{\alpha}\). The number of components R varies according to the magnitude of K, but always lies in the range N≥R≥M, with N being the number of vertices in the training set and M the number of classes. Higher values of K induce fewer and larger components in the constructed graph, while lower values lead to small sized ones. This wire mechanism leads to a graph with some important features: (i) By varying K, different graphs can be generated, and as the value of K increases, the number of components decreases monotonically to the number of classes. (ii) The total number of edges among the vertices of a component C_{ α } is proportional to K and can be at most equal to KN_{ α }, where N_{ α } is the number of vertices in component C_{ α }. (iii) This maximum value is only achieved if all vertices in the neighborhood of any vertex of the component have the same class. Likewise, nearby vertices of other classes decrease the number of connections of the given component. Thus, one can define a measure of “purity” for components, as explained ahead.
In this way, Φ_{ α }=1, if and only if, for every v_{ i } in the component C_{ α }, all the K neighbors have the same class label of v_{ i }. On the other hand, if there exists noise or two or more classes are mixed together, vertices in this region are unable to make their K connections due to the existence of vertices of other classes in the neighborhood of some vertices. In the latter case, the more mixing the components are, the lower their average degrees D_{ α } and consequently their respective purities Φ_{ α } are.
The optimal graph improves the representation of the training set and provides the best configuration of components according to their purities. It corresponds to the best graph organization regarding the purity measure.
3.2 The semisupervised Kassociated optimal graph
Consider now obtaining the optimal graph from a partially labeled set X. It is easy to see that it is not possible to obtain the aforementioned graph through the previous description due to the presence of unlabeled patterns. Therefore, we propose here the semisupervised construction of the Kassociated optimal graph.
The problem addressed here regards the absence of enough labeled data in a given data set to employ a regular supervised method. Therefore, it is necessary to consider a semisupervised method in order to induce a classifier from both labeled and unlabeled patterns. Hence, consider the data set X={(x_{1},c_{1}),…,(x_{ l },c_{ l }),x_{l+1},…,x_{ N }} with l labeled patterns (x_{ i },c_{ i }) and N−l unlabeled patterns x (or (x_{ j },∅)). As its supervised counterpart, the semisupervised Kassociated optimal graph construction involves creating a sequence of semisupervised Kassociated graphs. The main difference between the supervised and semisupervised Kassociated graphs can be stated in relation to the set of neighbors, to which each vertex connects. Instead of considering only the labeldependent set (\(\varDelta _{v_{y},K}\)), here, each vertex v_{ i } connects to all vertices in the set \(\varGamma_{v_{i},K} = \{v_{j} \mid v_{j} \in\varLambda_{v_{i},K}\ \mathrm{AND}\ (c_{j} = c_{i}\ \mathrm{OR}\ c_{i} =\emptyset\ \mathrm{OR}\ c_{j} = \emptyset) \}\). This set encompasses the K nearest neighbors of v_{ i } whose classes are not different from the class of v_{ i }. This means that, among its K nearest neighbors, v_{ i } connects to those vertices which belong to the same class of v_{ i } or to those with no label. If v_{ i } itself does not have a class label, it connects to all the K nearest neighbors without considering their classes.
The cutting process in the component C_{ α } finishes until it is separated into single class components. The rationale behind the criterion is that by cutting the edges that connects low purity vertices and whose respective patterns are distant from each other, it is more likely to obtain separated wellconnected components. In fact, low purity vertices are usually found in boundary regions between components of different classes in supervised tasks. However, in the semisupervised scenario, purity itself can be a misleading measure due to high connection probability of the unlabeled vertices. Therefore the distance weight in Eq. (3) favors cutting the edges with highest distance in the component.
In summary, the main modifications in the original Kassociated optimal graph construction algorithm [4] include connecting each vertex to all its neighbors with the same class or without a class label (line 6) and merging every component with empty class (in the Algorithm 1, \(\hat{C}_{\alpha}\) stands for the class of a component) to another component, independent of purity. Notice that the present algorithm not only can construct the Kassociated optimal graph, but also, by doing so, can spread the labels throughout the whole training set. Therefore, the KAOGSS algorithm is a transductive method.
3.3 The KAOG classifier
 1.
Calculate the distances between the new pattern y and all elements x_{ i } in the training set
 2.
Find the K_{ L } nearest neighbors of y; noted in ascending order as \(\bar{\varLambda}_{v_{y},K_{L}} = \{\mathbf{x}_{(1)},\mathbf {x}_{(2)},\ldots,\mathbf{x}_{(k)},\ldots, \mathbf{x}_{(K_{L})}\}\)
 3.For k=1 to K_{ L }

Locate the vertex (and component) that represents x_{(k)}, say v_{ j }∈C_{ α }

If k≤K_{ α } then

Connect v_{ y } to v_{ j }


3.4 Classifying partially labeled data stream
Algorithm 2 presents the KAOGINCSSL algorithm, which processes a data stream S composed of partially labeled and unlabeled data sets. The function nextChunk(S) removes the next set from stream S and put it into the variable Z used to represent a chunk of data. After assigning the next set to the variable Z, the algorithm determines if the set is partially labeled to be considered for training/updating (i.e. if the set has enough labeled patterns, e.g., at least 5 %) through the function isPartiallyLabeled(Z) which returns “true” if Z is partially labeled and “false” otherwise.
Therefore, the tasks of the algorithm are twofold, (i) incorporate new knowledge from both labeled and unlabeled patterns to subdue concept drift and (ii) predict the label for the unlabeled patterns presented in unlabeled sets. In the former task, the objective is to incorporate new knowledge from the recent obtained partially labeled set, thus a semisupervised Kassociated optimal graph is derived using Algorithm 1 (KAOGSS). As explained in Sect. 3.2, the KAOGSS algorithm generates the Kassociated optimal graph spreading the labels to all vertices and the resulting graph is composed of several disjoint components. These new components are then merged to the principal graph (G_{ P }), which is composed of independent components. However, the addition of new components increases the size of the principal graph, which may increment classification error and time. To avoid this problem, the principal graph should not grow unlimitedly, thus, old and unused components should be removed.
The task of classifying new patterns takes place if the set at hand is unlabeled, and it is resolved by simply applying the KAOG classifier using the principal graph to classify unlabeled vertices, as presented in Sect. 3.3. Component removal takes place during classification phase by applying a method named disuse rule. This rule establishes a maximum number of consecutive classifications in which a component is allowed to be unused (i.e. do not receive any connections during classification). The maximum value accepted is set by the parameter τ. When a component remains out of use after τ patterns are classified, it is removed from the principal graph. The algorithm finishes when the whole stream has been processed, i.e. S=∅.
An important feature of stream classification algorithms is its ability to process data in a reasonable time, which includes the tasks of training, updating and classifying. The proposed algorithm consists of the following phases of data processing: (i) training or updating the principal graph, (ii) classifying new data and, (iii) removing unused components.
In the first phase, training or updating the principal graph is required whenever a partially labeled set X is presented. Let there be N instances in the set X; training (or updating) corresponds to build a semisupervised Kassociated optimal graph (Algorithm 1). As estimated in Ref. [4], the complexity order to build a supervised Kassociated optimal graph is about O(N^{2})—due to distance matrix calculation. Also, it has been shown that the Kassociated optimal graph construction scales better than the C4.5 and the Gibbs Sampling algorithms. Taking into account that the only addition in processing time in the semisupervised version is the need to verify whether a component presents more than one class and, in this case, the algorithm cuts out some edges to divide it into some single class components. Knowing that the process of finding and cutting a component by using the proposed technique depends on the number of edges and vertices in the component (O(N_{ α }+E_{ α })), where N_{ α } and E_{ α } are the number of vertices and edges in the component C_{ α }, respectively. Since Kassociated graphs are sparse, thus, N_{ α } or E_{ α } is much smaller than the number of vertices in the whole graph. Allied to the fact that few components need to be partitioned (those components, which are composed of vertices from more than one class), it can be verified that the computational order of this phase remains O(N^{2}).
Now we consider the second phase, the order of classifying a new pattern has also been estimated in the aforementioned work as O(N_{ p }), due to the distance calculation among the new vertex and the N_{ p } vertices in the principal graph. Here, it is important to mention that there exist strategies for lowering the order, for example locating the nearest components firstly, instead of actually searching for the vertices neighbors. Such strategy decreases the computational cost to O(N_{ cp }), with N_{ cp } being the number of components in the principal graph, and N_{ cp }≪N_{ p }. At last, component removal can be done by the disuse rule, which is done by simply checking the time parameter of each component, therefore, it has the order of O(N_{ cp }).
4 Experimental results
The experimental results are obtained considering five nonstationary data sets, with three of them generated artificially, SEA [29], Sine and Circles [13] and the other two are real data, Spam and Elec2 [16]. For all the experiments, Algorithm 1 is used to spread the label to all the training sets.
In order to simulate a stream of partially labeled data and qualify how the algorithms react with different amount of labeled patterns, we have generated nine experiments for each domain, differing from each other regarding to the percentage of labeled patterns in the training sets. With the percentages of labeled patterns lying in the set {90 %, 80 %, 70 %, 60 %, 50 %, 40 %, 30 %, 20 %, 10 %}. Each stream is presented as a sequence of chunks of data, alternating between a partially labeled set and a fully unlabeled set. The partially labeled sets are used for training (or updating) the classifiers, while the fully unlabeled sets are used as test sets to estimate the classification accuracy of the algorithms. Here, we use the real labels of the test sets to estimate the classifier accuracy. Among the artificially generated stream data, the SEA domain is presented along 500 realizations of training sets with 60 patterns and tests set with 40 patterns. The other two streams, Circle and Sine, are presented along 200 realizations of alternating training and test sets, each of them with 25 patterns. Regarding the two real data sets, we consider a real situation where there are not enough data to use for testing. Therefore the same set is firstly used for testing and then for training. The Elec2 domain represents the electricity price fluctuation gathered during a given period (for details, please refer to Ref. [13]). The domain is composed of 45,312 patterns, which can be divided into 134 chunks of 336 patterns (except for the first set with 288), representing a week of price variation. The spam base is composed of 4601 patterns representing spam and real mail, the chunks, in this case, are defined with 45 patterns except for the initial with 101, and the stream is presented along 100 realizations.
Regarding the algorithms under comparison, three of them are ensemble algorithms chosen due to their high adaptability. The SEA ensemble [29] consists of a pool of C4.5 classifiers [26] and works by evaluating each of the decision trees, whose output is used to decide the ensemble output by a simple majority voting scheme. Every time a training batch arrives, a new decision tree is trained and it replaces the tree in the ensemble with the major number of mistakes up to that point. Another algorithm implemented for comparison is the DWM [19], which consists of an ensemble method that virtually can be composed by any classifiers. Briefly, the DWM algorithm adds a new incremental classifier to the ensemble every time an error is committed by the ensemble. Each single classifier has a weight that is decreased by a determined factor β every time it commits an error. For controlling the size of the ensemble, at every p iteration, those classifiers whose weight is less than a predefined threshold θ are removed. As recommended by the authors, the incremental Naive Bayes (see Ref. [19] for details and references) has been used as base classifier, therefore we note DWMNB hereafter. The third algorithm, proposed by Wang et al. [33], is also an ensemble that uses a decision tree as base algorithm, similar to SEA, but with weighted classifiers. The weight of each base classifier is estimated by its classification accuracy in a test set. Therefore, the weight of base classifier h_{ k } is given by w_{ k }=MSE_{ r }−MSE_{ i }, where MSE_{ i } corresponds to the generalization error and can be obtained through a crossvalidation process; while MSE_{ r } is the estimated error given the new data set, and can be calculated as \(\mathrm{MSE}_{r} =\sum_{\omega_{j} \in\varOmega} p(\omega_{j}) (1  p(\omega_{j}))\), with p(ω_{ j }) the percentage of instances belonging to class ω_{ j }.
Considering the experimental results displayed in Fig. 3(a), as expected, all the algorithms tend to degenerate their performance as the labeling percentage provided in the training sets decays. Notice that the proposed algorithm KAOGINCSSL and the DWMNB algorithm have performed similarly throughout all the different label percentages domains, with exception to the experiment with 10 % of labeled patterns where the KAOGINCSSL algorithm presented a better performance. In fact, even when only 20 % of the training patterns are labeled, KAOGINCSSL and DWMNB present similar performance, differentiating by the fact that the proposed algorithm is much more stable, presenting the smallest variance. Regarding the WCEA algorithm, from Fig. 3(a), we see that it is the algorithm that suffers the most as the amount of labeled patterns decreases. Again, when considering 20 % labeled set, in spite of presenting very close result for the average error percentage to the SEA algorithm performance, the WCEA algorithm presents a larger variation on error rate along the stream processing, as can be seem in Figs. 3(b)–(c). The SEA ensemble has the worst performance in this domain.
In real applications, low variance or standard deviation is a desirable feature for a classifier, precisely the lower the standard deviation the more reliable is the classifier performance. Therefore, considering the results for the electricity domain presented in Fig. 4(c); except for the WCEA algorithm, all the others have presented a similar performance, in special for low levels of labeling (<40 %). Here, again the proposed algorithm obtained the best performances for the experiments with more than 50 % of labeled data. Analyzing the standard deviation in Fig. 5(a), it is easy to verify that the proposed algorithm presents the most reliable performance. The DWMNB algorithm presents too higher values of standard deviation indicating high fluctuation in classification performance, in spite of presenting good average accuracy. The SEA ensemble has good accuracy results and low variance.
Regarding the results of the KAOGINCSSL algorithm in the Spam base shown in Fig. 4(d), at a first glance, almost the same trend as in the Elec2 domain can be observed. Because it has presented best average accuracy performance for experiments with more than 50 % labeled patterns and average performance for the rest. In spite of that, the KAOGINCSSL algorithm shows again the most regular performance as depicted in Fig. 5(b). The DWMNB algorithm has also performs well, particularly up to the point where labeled data instances fall off from least than 40 %, but with higher standard deviation than the KAOGINCSSL. Thus, we can say that both KAOGINCSSL and DWMNB algorithms perform similarly. The SEA ensemble presents the lowest average accuracy but small standard deviation, while the WCEA instead of presenting near average mean accuracy also shows too high standard deviation, which discourages both to be used in this domain.
It is also important to notice that all the algorithms, which have used the KAOGSS as transduction algorithm, present good results, especially in the real domains. Therefore, we verify that the proposed transduction algorithm KAOGSS, not only can be used in association to the KAOGINCSSL algorithm, but also can be successfully used in other algorithms as well.
5 Conclusions
This paper has introduced a semisupervised graphbased algorithm suitable for nonstationary streaming application, particularly when only a small portion of the acquired data presents label. Comparative results on artificial and real data sets performed on the proposed method against three wellknow ensemble methods show that the proposed algorithm outperformed the compared algorithms in most of the experiments. Moreover, the results show that the present spreading label technique can be used successfully in other supervised learning algorithms to support semisupervised classification. Future work includes testing the proposed algorithm with more data sets and comparing to other algorithms with their own spreading label method, as well as comparing the accuracy of the optimal graph as a transductive method against other transductive ones.
Notes
Declarations
Acknowledgements
This work is supported by the Brazilian National Research Council (CNPq) and by the São Paulo State Research Foundation (FAPESP).
Authors’ Affiliations
References
 Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15:1373–1396MATHView ArticleGoogle Scholar
 Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res 1:1–48MathSciNetGoogle Scholar
 Bertini JR Jr, Lopes A, Motta R, Zhao L (2010) Online classifier based on the optimal Kassociated network. In: Proceedings of the joint conference, III international workshop on web and text intelligence (WTI’10), pp 826–835Google Scholar
 Bertini JR Jr, Zhao L, Motta R, Lopes A (2011) A nonparametric classification method based on Kassociated graphs. Inf Sci 181:5435–5456MathSciNetView ArticleGoogle Scholar
 Bornholdt S, Schuster H (eds) (2003) Handbook of graphs and networks: from the genome to the Internet, 1st edn. WileyVCH, WeinheimGoogle Scholar
 Breve FA, Zhao L, Quiles M, Pedrycz W, Liu J (2011) Particle competition and cooperation in networks for semisupervised learning. IEEE Trans Knowl Data Eng. doi:10.1109/TKDE.2011.119Google Scholar
 Chapelle O, Zien A, Schölkopf B (eds) (2006) Semisupervised learning, 1st edn. MIT Press, CambridgeGoogle Scholar
 Chapelle O, Sindhwani V, Keerthi S (2008) Optimization techniques for semisupervised support vector machines. J Mach Learn Res 9:203–233MATHGoogle Scholar
 Cormen T, Leiserson C, Rivest R, Stein C (2009) Introduction to algorithms, 3rd edn. MIT Press, CambridgeMATHGoogle Scholar
 Culp M, Michailidis G (2008) Graphbased semisupervised learning. IEEE Trans Pattern Anal Mach Intell 30(1):174–179View ArticleGoogle Scholar
 Ditzler G, Polikar R (2011) Semisupervised learning in nonstationary environments. In: Proceedings of international joint conference on neural networks (IJCNN’11), San Jose, CA, USA. IEEE Press, New York, pp 2741–2748View ArticleGoogle Scholar
 Erman J, Mahanti A, Arlitt M, Cohen I, Williamson C (2007) Offline/realtime traffic classification using semisupervised learning. Perform Eval 64:1194–1213View ArticleGoogle Scholar
 Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: Proceedings of the Brazilian symposium on artificial intelligence (SBIA’04), vol 3171. Springer, Berlin, pp 286–295Google Scholar
 GiraudCarrier C (2000) A note on the utility of incremental learning. AI Commun 13(4):215–223MATHGoogle Scholar
 Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference and prediction, 2nd edn. Springer, BerlinView ArticleGoogle Scholar
 Hettich S, Bay S (1999) The UCI KDD archive. University of California, Irvine, School of Information and Computer Sciences. http://kdd.ics.uci.edu/
 Kelly M, Hand D, Adams N (1999) The impact of changing populations on classifier performance. In: Proceedings of the international conference on knowledge discovery and data mining (KDD’99). ACM, New York, pp 367–371Google Scholar
 Klinkenberg R, Joachims T (2000) Detecting concept drift with support vector machines. In: Proceedings of the international conference on machine learning (ICML’00). Morgan Kaufmann, San Mateo, pp 487–494Google Scholar
 Kolter JZ, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8:2755–2790MATHGoogle Scholar
 Li P, Wu X, Hu X (2010) Mining recurring concept drift with limited labeled streaming data. In: JLMR: workshop and conference proceedings, vol 13, pp 241–252Google Scholar
 Lopes AA, Bertini JR Jr, Motta R, Zhao L (2009) Classification based on the optimal kassociated network. In: Proceedings of the international conference on complex sciences: theory and applications (COMPLEX’09). Lecture notes of the Institute for Computer Sciences, SocialInformatics and Telecommunications Engineering (LNICST), vol 4. Springer, Berlin, pp 1167–1177Google Scholar
 Masud M, Gao J, Khan L, Han J (2008) A practical approach to classify evolving data streams: training with limited amount of labeled data. In: Proceeding of the international conference on data mining (ICDM’08)Google Scholar
 Minku L, White A, Yao X (2010) The impact of diversity on online ensemble learning in the presence of concept drift. IEEE Trans Knowl Data Eng 22:730–742View ArticleGoogle Scholar
 Narasimhamurthy A, Kuncheva L (2007) A framework for generating data to simulate changing environments. In: Proceedings of the international artificial intelligence and applications (ICAIA’07), pp 384–389Google Scholar
 Quiles M, Zhao L, Alonso RL, Romero RAF (2008) Particle competition for complex network community detection. Chaos 18:033107MathSciNetView ArticleGoogle Scholar
 Quinlan JR (1993) C4.5 programs for machine learning, 1st edn. Morgan Kaufmann, San MateoGoogle Scholar
 Schaeffer S (2007) Graph clustering. Comput Sci Rev 1:27–34MATHView ArticleGoogle Scholar
 Schlimmer J, Granger R (1986) Beyond incremental processing: tracking concept drift. In: Proceedings of the association for the advancement of artificial intelligence (AAAI’86). AAAI Press, Menlo Park, pp 502–507Google Scholar
 Street N, Kim Y (2001) A streaming ensemble algorithm (SEA) for largescale classification. In: Proc int’l conf knowledge discovery and data mining (KDD’01). ACM, New York, pp 377–382Google Scholar
 Sung J, Kim D (2009) Adaptive active appearance model with incremental learning. Pattern Recognit Lett 30:359–367View ArticleGoogle Scholar
 Syed N, Liu H, Sung K (1999) Handling concept drift in incremental learning with support vector machines. In: Proceedings of the international conference on knowledge discovery and data mining (KDD’99), pp 272–276Google Scholar
 von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17:395–416MathSciNetView ArticleGoogle Scholar
 Wang H, Fan W, Yu P, Han J (2003) Mining conceptdrifting data streams using ensemble classifiers. In: Proc international conference on knowledge discovery and data mining (KDD’03), pp 226–235Google Scholar
 Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach Learn 23(1):69–101Google Scholar
 Yang C, Zhou J (2008) Nonstationary data sequence classification using online class priors estimation. Pattern Recognit 41:2656–2664MATHView ArticleGoogle Scholar
 Yu Y, Guo S, Lan S, Ban T (2008) Anomaly intrusion detection for evolving data stream based on semisupervised learning. In: Proceedings of the international conference on advances in neuroinformation processing (NIPS’08), pp 571–578Google Scholar
 Zhang P, Zhu X, Guo L (2009) Mining data streams with labeled and unlabeled training examples. In: Proceedings of the ninth IEEE international conference on data mining (ICDM’09). IEEE Press, New York, pp 627–636View ArticleGoogle Scholar
 Zhu X (2008) Semisupervised learning literature survey. Tech Rep 1530, ComputerScience, University of WisconsinMadisonGoogle Scholar
 Zhu X (2005) Semisupervised learning with graphs. Tech Rep Doctoral Thesis, School of Computer Science, Carnegie Mellon UniversityGoogle Scholar