 WTI
 Open Access
 Published:
Partially labeled data stream classification with the semisupervised Kassociated graph
Journal of the Brazilian Computer Society volume 18, pages 299–310 (2012)
Abstract
Regular data classification techniques are based mainly on two strong assumptions: (1) the existence of a reasonably large labeled set of data to be used in training; and (2) future input data instances conform to the distribution of the training set, i.e. data distribution is stationary along time. However, in the case of data stream classification, both of the aforementioned assumptions are difficult to satisfy. In this paper, we present a graphbased semisupervised approach that extends the static classifier based on the Kassociated Optimal Graph to perform online semisupervised classification tasks. In order to learn from labeled and unlabeled patterns, here we adapt the optimal graph construction to simultaneously spread the labels in the training set. The sparse, disconnected nature of the proposed graph structure gives flexibility to cope with nonstationary classification. Experimental comparison between the proposed method and three stateoftheart ensemble classification methods is provided and promising results have been obtained.
Introduction
Recently, graphbased (also referred to networkbased) algorithms applied to data mining tasks have attracted great attention in both theoretical research and practical applications [5]. This growing interest is mostly justified due to the advantages provided by graph representation, such as revealing topological structure of input data and the ability of identifying arbitrary shapes of data clusters [27]. In such graphbased algorithms, each vertex of the graph represents a data pattern (data instance) and the edges stand for some relation of similarity between vertices. In order to reveal significant relations within a data set, the following rule is usually considered for establishing connections between data patterns: the higher the similarity among data, the higher the probability of connection [39]. Stated in this way, nearby patterns tend to be heavily linked together while distant patterns may form a sparse structure. This property has been extensively explored using graphbased solutions, especially considering unsupervised tasks like clustering [32] and dimensionality reduction [1]. Only recently graphbased classification has been addressed, usually by the wrap of semisupervised learning [38].
Semisupervised learning methods concern the problem of automatic classification considering data sets with a small number of labeled data and a large amount of unlabeled data [7]. Such approach relies on the fact that labeled data are difficult to be gathered and often are associated with high costs, while unlabeled data are abundant in most applications and generally easy to be collected. Moreover, the manually labeling process is not always reliable or practicable. For example, consider obtaining enough labeled data to train a classifier for a spam detection task (i.e. classifying spam and valid email). Such application design (1) incurs cost in paying an expert or a group of users to label what they call spam from what they consider real email; (2) may result in inconsistencies if we accept all human categorization. For instance, an email message may be considered as a spam by some people, but it may be considered as a valid email by others; (3) not to mention the time required to manually label enough data to train a regular supervised learning method.
A spam detection application is really a stream classification problem, in the sense that the classifier needs to classify new patterns at the time they arrive [35]. In this kind of applications, the underlying data distribution changes over time, and such changes often make the model built on old data inconsistent to the newly arrived data. This problem, known as concept drift [34], requires frequent updating of the model. Summarizing, we have a classification problem which consists of a data stream where few instances are labeled and data distribution may change over time. This scenario poses a challenging task for machine learning because it presents too few labeled data along the stream to apply a supervised incremental algorithm and the presence of concept drift disables the use of static classifiers. In fact, only recently such applications have been properly addressed due to the concept of learning through both labeled and unlabeled data and the development of semisupervised learning strategies.
In the development of semisupervised learning algorithms, many efforts have been made on the use of a clustering algorithm to group the patterns and further spread the labels. When considering this approach, the Kmeans algorithm is a natural choice. Li et al. [20] proposed a treebased algorithm which uses the Kmeans to spread labels at the leaves of a tree. Masud et al. [22] proposed an ensemble of microclusters, obtained by using the Kmeans algorithm, then instances are classified according to the Knearest neighbor rule. Ditzler and Polikar [11] proposed an ensemble of classifiers, named WEA, which are trained with labeled patterns only. Then, unlabeled data and the Kmeans algorithm are used to generate a mixture of Gaussian models for further adjusting the weights of each classifier. Zhang et al. [37] use the semisupervised SVM [8] allied to a version of the Kmeans, referred to as relational Kmeans, to construct new features to the labeled examples by using information extracted from unlabeled instances. Some investigations have been made to tackle specific problems, e.g. Erman et al. [12] proposed a method to perform traffic classification in computer networks with partially labeled data. Their method uses a clustering algorithm, such as Kmeans, to obtain the clusters and then, the labels are spread using the maximum likelihood estimation. The clusters that remain unlabeled are likely to be an undefined group. Also regarding computer network, Yu et al. [36] have considered the problem of intrusion detection. They employ a strategy similar to the Kmeans by grouping the labeled data and then, the labels are spread to the whole data set according to the distances from the clusters to unlabeled patterns. Finally, a SVM is trained to detect intrusion.
To the best of our knowledge, graphbased approach has not been considered to tackle streaming classification problems where data are partially labeled; although it is successfully applied to semisupervised learning, especially to the transduction problem [2, 6, 10, 25]. In view of the recent developed graphbased nonparametric classification method and its good performance on stationary data sets [4, 21]; we had proposed a nonstationary version with initial results reported in Ref. [3]. In this paper, we propose an extended version to be applied in the context of nonstationary stream of partially labeled data. The aforementioned graphbased method is based on representing the training set as a special graph, referred to as Kassociated graph. The Kassociated graph is able to represent similarity relations among data instances and the purity of a component (connected subgraph) is able to represent the data topology. Purity characterizes the degree in which instances of different classes are mixed in a same region of the data space. In this work we propose a new constructing procedure for the Kassociated graph that takes into account partially labeled sets. Also, this work shows how the graph is updated along the time to allow data stream processing.
The remainder of the paper is organized as follows: In Sect. 2, we briefly describe the problem of concept drift and also a toy example to illustrate a scenario where incremental learning is applicable. Section 3 presents the proposed method for nonstationary partially labeled stream classification. This section is further divided into four subsections, where Sect. 3.1 first introduces the Kassociated graphs and the Kassociated optimal graph. The new method for constructing the aforementioned graphs from partially labeled data sets is described in Sect. 3.2. Moreover, Sect. 3.3 briefly treats the static KAOG classifier [4] and Sect. 3.4 details how the graph is updated over time. Section 4 presents the experimental results concerning the performance comparison between the proposed algorithm and three wellknow fully supervised streaming ensemble classifiers on nonstationary partially labeled benchmarks. Section 5 concludes the paper and discusses some future works.
Background
The nonstationary nature of data streams cause a phenomenon called concept drift (or also termed concept substitutions, revolutionary changes, population drift) [17, 24, 28, 34], in which concept refers to the data distribution in a given period of time. In the literature, however, the term concept drift has been used in reference to different phenomena relating to drop down the classifier accuracy performance [31]. According to Kelly et al. [17], concept drift occurs due to alterations on the following probabilities of data production:

A priori of classes P(ω_{1}),…,P(ω_{ M }), i.e. alteration on the relative size of a given class or the appearance of new classes.

Conditional P(x∣ω_{ i }), i.e., changing on class definition. For example, changes in the shape of a class.

Conditional a posterioriP(ω_{ i }∣x), i.e., modification on some of the attributes;
In general terms, concept drift can be characterized according to the variation of the concepts mainly regarding two features, velocity and recurrence along the time. Basically, in the former, a concept drift can be divided into gradual drift and abrupt drift; while in the latter, a concept drift is recurrent if past concept turns to be current concept. Both kinds of concept drift are sketched in Fig. 1.
In Fig. 1, the (blue) rectangles represent the instances that belong to class ω_{1} and the (red) circles represent the instances belonging to class ω_{2}. Consider Figs. 1(a)–(d) as a sequence of data distributions of an application presented in time, initiating at t_{0}. The concept drifts that occur between distributions of Figs. 1(a) and 1(b), as well as between Figs. 1(b) and 1(c), are abrupt. Also notice that the distribution shown by Fig. 1(c) is similar to that in Fig. 1(a), which mean that the distribution at time t_{0} in Fig. 1(a) occurs again at time t_{j+1}, after experiencing a completely different distribution (Fig. 1(b)). This phenomenon characterizes a recurrent concept. As the time line shows, from Figs. 1(a) and 1(d), each distribution can, eventually, remain static for a given period of time, e.g., the initial distribution remains static from t_{0} to t_{ i }. Nonetheless, on the next iteration t_{i+1}, the distribution can be totally altered, i.e. an abrupt drift occurs. The drift between distributions in Figs. 1(c) and 1(d) is also considered abrupt, in spite of being less severe than the previous one. Consider now a situation where two groups of data from different classes cross each other along time, depicted in order in Figs. 1(e)–(g), from an initial distribution (Fig. 1(e)) to a final one (Fig. 1(g)) with Fig. 1(f) corresponding to an intermediate distribution. In such a scenario, the distribution varies smoothly throughout the time, which characterizes a gradual drift. At last, let Fig. 1(h) represent a distribution determined by a rotating hyperplane along the time. If the hyperplane is rotated by π/4 regularly at a given period of time, the drift is characterized as gradual and recurrent at every eight alterations of the hyperplane. However, if the angular velocity rate is increased, say to π, the drift now can be considered abrupt. This demonstrates that it is surprisingly difficult to accurately characterize concept drifts considering only velocity and recurrence. In view of this problem, many researchers have proposed different drift categories; for a recent work, refer to Ref. [23].
In spite of characterizing concept drift, the main concern is that, most of the time, the variation in the underlying data distributions degenerates the performance of the classifier in use. The need for replacing a classifier due to the drop in accuracy, caused or not by a concept drift, is called virtual concept drift [18]. The trivial way to treat virtual drift is to replace the low accuracy classifier by a new one. However, such strategy brings at least three prohibitive drawbacks, (i) retraining new classifiers usually is computationally expensive; (ii) detecting when the current classifier is no longer useful is quite challenging, mainly due to the natural fluctuations in performance that can be confused with real concept drift; (iii) selecting what data should be used to train the new classifier is also a hard task. Fortunately, incremental learning algorithms can be applied to provide practical solutions to tackle classification problems on nonstationary domains. Such an approach enables a classifier to acquire knowledge during application phase, updating the model with new data, and without explicitly retraining itself [14, 30].
For clarifying the advantages of an incremental classifier over a static one in nonstationary domains, consider the following experiment with the artificial data set known as banana set (see Fig. 2(a)). The experiment consists of comparing both approaches, static and incremental, of the classifier based on the Kassociated graph. For doing so, in both cases, the classifier is trained with a limited subset (set “s1” in Fig. 2(a)). The rest of the set is divided into seven groups and used as test sets that are sequentially presented to the classifier. In Fig. 2(a) “s1” is the original training set and the others correspond to the first, second, and seventh test sets, respectively. The results are averaged over 10 runs, at each run, an optimal graph is built considering 400 examples (200 of each class) randomly chosen from the training data group. After training, the test examples are chosen obeying the group sequence, onebyone, 200 examples are randomly chosen from each group (100 for each class), then, the next group takes place and so on.
Figure 2(b) shows the results of the comparison between the Kassociated static and incremental classifiers. The significant difference between them is due to the fact that the static classifier no longer learns with new instances, however the incremental classifier is able to learn during classification phase. The presented incremental learning process is analogous to the linearization technique widely used to study local properties of nonlinear systems. Specifically, linearization of a neighborhood of a certain point corresponds to subset selection in incremental learning. Nonlinearity of the system corresponds to twisted shape of classes and changing of data distribution over time. In a nonlinear systems, linearization usually can obtain good approximation if the neighborhood under analysis is small. For the same reason, we expect that good classification results can be obtained by updating the network with small data subset each time.
The semisupervised Kassociated optimal graph
The semisupervised Kassociated graph, proposed here, consists of a modification of the Kassociated graph [4] to deal with both labeled and unlabeled data during the graph construction procedure. Therefore, in order to introduce the semisupervised version, a brief revision of the Kassociated graph is presented in Sect. 3.1. It is followed by the semisupervised Kassociated graph construction presented in detail in Sect. 3.2. Both supervised and semisupervised Kassociated optimal graphs can be seen as the training process for the KAOG classifier which uses the components of the graph and their purities to classify new data instances, as will be exposed in Sect. 3.3.
The Kassociated graph and the Kassociated optimal graph
A Kassociated graph is constructed from a vectorbased data set X={x_{1},…,x_{ N }} by representing each data instance x_{ i }=(x_{i1},x_{i2},…,x_{ ip },c_{ i }) as a vertex v_{ i } with its associated class label c_{ i }, where c_{ i }∈Ω={ω_{1},ω_{2},…,ω_{ M }} and M is the number of classes in the problem. The graph construction resembles to a KNN graph, due to the use of a predefined number of neighbors, K, that each vertex must connect. Although the Kassociated graph does differ from the KNN approach by the fact that amongst the possible K neighbors of a vertex v_{ i }, it can only be connected to neighbors of the same class as v_{ i }. Hence, we consider the labelindependent and the labeldependent Kneighborhood of vertex v_{ i }. The former is simply the set of vertices that represents the K nearest neighbors of the instance x_{ i } according to a given measure and will be noted by \(\varLambda_{v_{i},K}\). The latter comprises only the vertices with the same class as v_{ i } among its K nearest neighbors, and is defined as \(\varDelta _{v_{i},K} = \{v_{j}\mid v_{j} \in\varLambda_{v_{i},K}\ \mathrm{AND}\ c_{i} = c_{j}\}\).
In a formal way, the Kassociated graph is defined as a directed graph G=(V,E) which consists of a set of labeled vertices V and a set of edges E between them, where an edge e_{ ij }=(v_{ i },v_{ j }) connects vertex v_{ i } with vertex v_{ j } if and only if \(v_{j} \in \varDelta _{v_{i},K}\). As a consequence, only vertices of the same class can be connected. The resulting Kassociated graph can be viewed as a set of disjoint subgraphs or componentsC={C_{1},…,C_{ α },…,C_{ R }}. Each component C_{ α } is composed by vertices of a single class, thus each component represents a single class, which we refer to the label of component C_{ α } as \(\hat{C}_{\alpha}\). The number of components R varies according to the magnitude of K, but always lies in the range N≥R≥M, with N being the number of vertices in the training set and M the number of classes. Higher values of K induce fewer and larger components in the constructed graph, while lower values lead to small sized ones. This wire mechanism leads to a graph with some important features: (i) By varying K, different graphs can be generated, and as the value of K increases, the number of components decreases monotonically to the number of classes. (ii) The total number of edges among the vertices of a component C_{ α } is proportional to K and can be at most equal to KN_{ α }, where N_{ α } is the number of vertices in component C_{ α }. (iii) This maximum value is only achieved if all vertices in the neighborhood of any vertex of the component have the same class. Likewise, nearby vertices of other classes decrease the number of connections of the given component. Thus, one can define a measure of “purity” for components, as explained ahead.
Let the degree d_{ i } of a vertex v_{ i } be defined as the sum of the connections it receives (indegree) and the connections it performs (outdegree) to other vertices, so \(d_{i} = d_{i}^{\mathrm{in}} + d_{i}^{\mathrm {out}}\). Also, consider the average degree taken for component C_{ α } be defined by \(D_{\alpha} = 1/N_{\alpha} \sum_{v_{i} \in C_{\alpha}} d_{i}\). According to the way that the Kassociated graph is constructed, a vertex can perform at most K connections, thus, the maximal total outdegree of component C_{ α } is KN_{ α }; symmetrically, the total indegree is also KN_{ α }, resulting in average degree being equal to 2K. Hence, a key idea is to use the ratio defined in Eq. (1) as a measure of “purity” for component C_{ α }, because it quantifies how intertwined a component is with vertices of other classes,
In this way, Φ_{ α }=1, if and only if, for every v_{ i } in the component C_{ α }, all the K neighbors have the same class label of v_{ i }. On the other hand, if there exists noise or two or more classes are mixed together, vertices in this region are unable to make their K connections due to the existence of vertices of other classes in the neighborhood of some vertices. In the latter case, the more mixing the components are, the lower their average degrees D_{ α } and consequently their respective purities Φ_{ α } are.
Clearly, the structure of a Kassociated graph depends on the value of K and on the nature of the input data set. Also, Kassociated graphs formed with different K will present different components with different purity values. Bearing this in mind, a suggestive idea is to obtain a graph with the best organization of components without using a unique value of K, i.e., each component has its own optimal value of K, denoted as K_{ α } for component C_{ α }. Therefore, the rationale for obtaining the optimal graph is to construct K=1,…,K_{max} associated graphs while keeping the best components found at each K throughout this process. Let β also be an index of component, therefore, a component \(C_{\beta }^{(K+z)}\) from the (K+z)associated graph will replace all components from the Kassociated graph that satisfy Eq. (2), for some integer z≥1 and (z+K)≤K_{max},
The optimal graph improves the representation of the training set and provides the best configuration of components according to their purities. It corresponds to the best graph organization regarding the purity measure.
The semisupervised Kassociated optimal graph
Consider now obtaining the optimal graph from a partially labeled set X. It is easy to see that it is not possible to obtain the aforementioned graph through the previous description due to the presence of unlabeled patterns. Therefore, we propose here the semisupervised construction of the Kassociated optimal graph.
The problem addressed here regards the absence of enough labeled data in a given data set to employ a regular supervised method. Therefore, it is necessary to consider a semisupervised method in order to induce a classifier from both labeled and unlabeled patterns. Hence, consider the data set X={(x_{1},c_{1}),…,(x_{ l },c_{ l }),x_{l+1},…,x_{ N }} with l labeled patterns (x_{ i },c_{ i }) and N−l unlabeled patterns x (or (x_{ j },∅)). As its supervised counterpart, the semisupervised Kassociated optimal graph construction involves creating a sequence of semisupervised Kassociated graphs. The main difference between the supervised and semisupervised Kassociated graphs can be stated in relation to the set of neighbors, to which each vertex connects. Instead of considering only the labeldependent set (\(\varDelta _{v_{y},K}\)), here, each vertex v_{ i } connects to all vertices in the set \(\varGamma_{v_{i},K} = \{v_{j} \mid v_{j} \in\varLambda_{v_{i},K}\ \mathrm{AND}\ (c_{j} = c_{i}\ \mathrm{OR}\ c_{i} =\emptyset\ \mathrm{OR}\ c_{j} = \emptyset) \}\). This set encompasses the K nearest neighbors of v_{ i } whose classes are not different from the class of v_{ i }. This means that, among its K nearest neighbors, v_{ i } connects to those vertices which belong to the same class of v_{ i } or to those with no label. If v_{ i } itself does not have a class label, it connects to all the K nearest neighbors without considering their classes.
As a consequence of connecting unlabeled vertices to labeled vertices regardless to their classes, components with more than one class may be formed. However, having components with more than one class precludes the classifier to make decisions. In other words, each component must be formed by vertices belonging to a single class and different from null. Thus, to overcome this problem, we propose splitting those components with vertices associated to two or more classes. For splitting a component, the rationale is to cut a few edges in order to end up with separated wellconnected clusters of vertices. In other words, this is a mincut problem, which can be resolved, for example, by the Ford–Fulkerson algorithm [9] for the twoclass case. However, as we consider multiclass classification, there exists the problem that a component might be composed by vertices from more than two classes. For this reason, we propose cutting the component based on the purity of vertex, defined as d_{ i }/2K, where d_{ i } stands for the degree of v_{ i }. Again, consider W_{i,j} the distance, used to construct the graph, between patterns x_{ i } and x_{ j }. Thus the proposed separation approach consists of successive removing the edge with minimum value of cut from the component, as defined in Eq. (3). Otherwise stated, the next edge to be removed (v_{ a },v_{ b })∈C_{ α } must satisfy cut_{a,b}=min(cut_{i,j}) ∀(v_{ i },v_{ j })∈C_{ α }.
The cutting process in the component C_{ α } finishes until it is separated into single class components. The rationale behind the criterion is that by cutting the edges that connects low purity vertices and whose respective patterns are distant from each other, it is more likely to obtain separated wellconnected components. In fact, low purity vertices are usually found in boundary regions between components of different classes in supervised tasks. However, in the semisupervised scenario, purity itself can be a misleading measure due to high connection probability of the unlabeled vertices. Therefore the distance weight in Eq. (3) favors cutting the edges with highest distance in the component.
Algorithm 1 details the construction of the semisupervised Kassociated optimal graph. The function findComponents() determines the graph components by implementing a breadthfirst search [9]. Then, the components having vertices belonging to more than one class are separated by the function splitNonSingleClassComponents(). This function implements the cutting procedure described earlier and returns two or more single class components, which can include components without a class label. The next step consists of spreading the labels within every component by calling the function spreadLabel(). After this stage, all vertices in any given component are labeled with a single class label or are unlabeled. To finish the Kassociated graph construction, the purity measure is calculated through the function purity() for all components. At the end of this process, if K=1, then the graph generated so far is the optimal graph and it is assigned to \(G^{(\mathrm {opt})}_{s}\). Otherwise, each component of the current Kassociated graph, \(G^{(K)}_{s}\), is compared to the components in the graph, \(G^{(\mathrm{opt})}_{s}\), having the same vertices (condition in line 20). The new component will substitute the corresponding old ones if the purity is increased or maintained for the labeled components. The process goes on by increasing K and generating a new Kassociated graph until the number of components in this new graph matches the number of classes in the problem (R=M).
In summary, the main modifications in the original Kassociated optimal graph construction algorithm [4] include connecting each vertex to all its neighbors with the same class or without a class label (line 6) and merging every component with empty class (in the Algorithm 1, \(\hat{C}_{\alpha}\) stands for the class of a component) to another component, independent of purity. Notice that the present algorithm not only can construct the Kassociated optimal graph, but also, by doing so, can spread the labels throughout the whole training set. Therefore, the KAOGSS algorithm is a transductive method.
The KAOG classifier
This section presents the nonparametric classifier that uses the Kassociated optimal graph structure to infer the class of new patterns, for more details, please refer to Ref. [4]. In order to present how a new pattern is classified, consider again a training pattern x_{ i } represented by x_{ i }=(x_{i1},x_{i2},…,x_{ ip },c_{ i }), which x_{ i } represents the ith training pattern with c_{ i } its associated class label, in a Mclass problem c_{ i }∈Ω={ω_{1},ω_{2},…,ω_{ M }}. In the same way, a new pattern is defined as y=(y_{1},y_{2},…,y_{ p }), excepted that now its class label must be estimated. Consider also the set of components of the optimal graph C={C_{1},…,C_{ α },…,C_{ R }}, where R is the number of components and R≥M. In order to classify the new pattern y, we must firstly transform it to a vertex, noted by v_{ y }, then connect it into the graph as explained ahead. Consider K_{ L } the largest value of K in the Kassociated optimal graph, or equivalently the K value from the last obtained component. For every new pattern y, we do:

1.
Calculate the distances between the new pattern y and all elements x_{ i } in the training set

2.
Find the K_{ L } nearest neighbors of y; noted in ascending order as \(\bar{\varLambda}_{v_{y},K_{L}} = \{\mathbf{x}_{(1)},\mathbf {x}_{(2)},\ldots,\mathbf{x}_{(k)},\ldots, \mathbf{x}_{(K_{L})}\}\)

3.
For k=1 to K_{ L }

Locate the vertex (and component) that represents x_{(k)}, say v_{ j }∈C_{ α }

If k≤K_{ α } then

Connect v_{ y } to v_{ j }


Once the new vertex v_{ y } is connected to the Kassociated graph, its class label is estimated using the Bayes theory [15]. The connection established during classification are temporary, i.e. they will not be incorporated into the graph structure. The posterior probability of a new vertex v_{ y } to belong to component C_{ α } given the set of labelindependent neighbors of v_{ y }, noted by \(\varLambda_{v_{y}}\), is defined by Eq. (4),
Knowing that each component C_{ α } has been formed in a particular Kassociated graph among the various generated graphs, we must consider the particular value of K in which C_{ α } was formed, noted by K_{ α }. Let \(\varLambda_{v_{y},K_{\alpha}}\) represent the set of K_{ α } nearest neighbors of v_{ y }. Thus, in order to estimate the probability \(P(\varLambda_{v_{y}}\mid v_{y} \in C_{\alpha })\), one must consider the fraction among the connections made with component C_{ α } over all possible K_{ α } connections, as shown in Eq. (5),
The prior probabilities P(v_{ y }∈C_{ α }) are defined as the normalized purities among the components to which v_{ y } is connected as \(P(v_{y} \in C_{\alpha}) = \varPhi_{\alpha} / \sum_{N_{v_{y},C_{\beta}} \neq 0} \varPhi_{\beta}\), where \(N_{v_{y},C_{\beta}}\) represents the number of connections v_{ y } has to component C_{ β }. Accordingly, the normalizing term is given by Eq. (6),
In many cases, there are more components than number of classes, according to Bayes optimal classifier, it is necessary to sum the posterior probabilities of all components corresponding to the same class. Finally the largest value among the found posterior probabilities reflects the most probable class for the new pattern, according to Eq. (7), where φ(y) stands for the class attributed for instance y,
Classifying partially labeled data stream
This section exposes how the proposed graphbased structure copes with nonstationary classification. Consider a stream S={X_{1},Y_{1},…,X_{ T },Y_{ T }}, where X_{ t }={(x_{1},c_{1}),…,(x_{ l },c_{ l }),x_{l+1},…,x_{ N }} contains labeled and unlabeled patterns; while Y_{ t }={y_{1},…,y_{ M }} is formed with unlabeled patterns only. Such streams may present concept drift at any time. Therefore, an online classifier should have the ability to evolve by adding new knowledge along time without being retrained. In the proposed approach, this dynamical evolution is done by considering a dynamic graph, named principal graph, which grows with the frequent addition of components provided by the Kassociated optimal graph formation (Algorithm 1) along the data stream processing. Algorithm 2 details the proposed approach.
Algorithm 2 presents the KAOGINCSSL algorithm, which processes a data stream S composed of partially labeled and unlabeled data sets. The function nextChunk(S) removes the next set from stream S and put it into the variable Z used to represent a chunk of data. After assigning the next set to the variable Z, the algorithm determines if the set is partially labeled to be considered for training/updating (i.e. if the set has enough labeled patterns, e.g., at least 5 %) through the function isPartiallyLabeled(Z) which returns “true” if Z is partially labeled and “false” otherwise.
Therefore, the tasks of the algorithm are twofold, (i) incorporate new knowledge from both labeled and unlabeled patterns to subdue concept drift and (ii) predict the label for the unlabeled patterns presented in unlabeled sets. In the former task, the objective is to incorporate new knowledge from the recent obtained partially labeled set, thus a semisupervised Kassociated optimal graph is derived using Algorithm 1 (KAOGSS). As explained in Sect. 3.2, the KAOGSS algorithm generates the Kassociated optimal graph spreading the labels to all vertices and the resulting graph is composed of several disjoint components. These new components are then merged to the principal graph (G_{ P }), which is composed of independent components. However, the addition of new components increases the size of the principal graph, which may increment classification error and time. To avoid this problem, the principal graph should not grow unlimitedly, thus, old and unused components should be removed.
The task of classifying new patterns takes place if the set at hand is unlabeled, and it is resolved by simply applying the KAOG classifier using the principal graph to classify unlabeled vertices, as presented in Sect. 3.3. Component removal takes place during classification phase by applying a method named disuse rule. This rule establishes a maximum number of consecutive classifications in which a component is allowed to be unused (i.e. do not receive any connections during classification). The maximum value accepted is set by the parameter τ. When a component remains out of use after τ patterns are classified, it is removed from the principal graph. The algorithm finishes when the whole stream has been processed, i.e. S=∅.
An important feature of stream classification algorithms is its ability to process data in a reasonable time, which includes the tasks of training, updating and classifying. The proposed algorithm consists of the following phases of data processing: (i) training or updating the principal graph, (ii) classifying new data and, (iii) removing unused components.
In the first phase, training or updating the principal graph is required whenever a partially labeled set X is presented. Let there be N instances in the set X; training (or updating) corresponds to build a semisupervised Kassociated optimal graph (Algorithm 1). As estimated in Ref. [4], the complexity order to build a supervised Kassociated optimal graph is about O(N^{2})—due to distance matrix calculation. Also, it has been shown that the Kassociated optimal graph construction scales better than the C4.5 and the Gibbs Sampling algorithms. Taking into account that the only addition in processing time in the semisupervised version is the need to verify whether a component presents more than one class and, in this case, the algorithm cuts out some edges to divide it into some single class components. Knowing that the process of finding and cutting a component by using the proposed technique depends on the number of edges and vertices in the component (O(N_{ α }+E_{ α })), where N_{ α } and E_{ α } are the number of vertices and edges in the component C_{ α }, respectively. Since Kassociated graphs are sparse, thus, N_{ α } or E_{ α } is much smaller than the number of vertices in the whole graph. Allied to the fact that few components need to be partitioned (those components, which are composed of vertices from more than one class), it can be verified that the computational order of this phase remains O(N^{2}).
Now we consider the second phase, the order of classifying a new pattern has also been estimated in the aforementioned work as O(N_{ p }), due to the distance calculation among the new vertex and the N_{ p } vertices in the principal graph. Here, it is important to mention that there exist strategies for lowering the order, for example locating the nearest components firstly, instead of actually searching for the vertices neighbors. Such strategy decreases the computational cost to O(N_{ cp }), with N_{ cp } being the number of components in the principal graph, and N_{ cp }≪N_{ p }. At last, component removal can be done by the disuse rule, which is done by simply checking the time parameter of each component, therefore, it has the order of O(N_{ cp }).
Experimental results
The experimental results are obtained considering five nonstationary data sets, with three of them generated artificially, SEA [29], Sine and Circles [13] and the other two are real data, Spam and Elec2 [16]. For all the experiments, Algorithm 1 is used to spread the label to all the training sets.
In order to simulate a stream of partially labeled data and qualify how the algorithms react with different amount of labeled patterns, we have generated nine experiments for each domain, differing from each other regarding to the percentage of labeled patterns in the training sets. With the percentages of labeled patterns lying in the set {90 %, 80 %, 70 %, 60 %, 50 %, 40 %, 30 %, 20 %, 10 %}. Each stream is presented as a sequence of chunks of data, alternating between a partially labeled set and a fully unlabeled set. The partially labeled sets are used for training (or updating) the classifiers, while the fully unlabeled sets are used as test sets to estimate the classification accuracy of the algorithms. Here, we use the real labels of the test sets to estimate the classifier accuracy. Among the artificially generated stream data, the SEA domain is presented along 500 realizations of training sets with 60 patterns and tests set with 40 patterns. The other two streams, Circle and Sine, are presented along 200 realizations of alternating training and test sets, each of them with 25 patterns. Regarding the two real data sets, we consider a real situation where there are not enough data to use for testing. Therefore the same set is firstly used for testing and then for training. The Elec2 domain represents the electricity price fluctuation gathered during a given period (for details, please refer to Ref. [13]). The domain is composed of 45,312 patterns, which can be divided into 134 chunks of 336 patterns (except for the first set with 288), representing a week of price variation. The spam base is composed of 4601 patterns representing spam and real mail, the chunks, in this case, are defined with 45 patterns except for the initial with 101, and the stream is presented along 100 realizations.
Regarding the algorithms under comparison, three of them are ensemble algorithms chosen due to their high adaptability. The SEA ensemble [29] consists of a pool of C4.5 classifiers [26] and works by evaluating each of the decision trees, whose output is used to decide the ensemble output by a simple majority voting scheme. Every time a training batch arrives, a new decision tree is trained and it replaces the tree in the ensemble with the major number of mistakes up to that point. Another algorithm implemented for comparison is the DWM [19], which consists of an ensemble method that virtually can be composed by any classifiers. Briefly, the DWM algorithm adds a new incremental classifier to the ensemble every time an error is committed by the ensemble. Each single classifier has a weight that is decreased by a determined factor β every time it commits an error. For controlling the size of the ensemble, at every p iteration, those classifiers whose weight is less than a predefined threshold θ are removed. As recommended by the authors, the incremental Naive Bayes (see Ref. [19] for details and references) has been used as base classifier, therefore we note DWMNB hereafter. The third algorithm, proposed by Wang et al. [33], is also an ensemble that uses a decision tree as base algorithm, similar to SEA, but with weighted classifiers. The weight of each base classifier is estimated by its classification accuracy in a test set. Therefore, the weight of base classifier h_{ k } is given by w_{ k }=MSE_{ r }−MSE_{ i }, where MSE_{ i } corresponds to the generalization error and can be obtained through a crossvalidation process; while MSE_{ r } is the estimated error given the new data set, and can be calculated as \(\mathrm{MSE}_{r} =\sum_{\omega_{j} \in\varOmega} p(\omega_{j}) (1  p(\omega_{j}))\), with p(ω_{ j }) the percentage of instances belonging to class ω_{ j }.
Figure 3(a) shows the accuracy for the tested algorithms on the nine different experiments regarding the percentage of labeled pattern in the training sets for the SEA domain. Each experiment result shows the classification accuracy on a test set averaged by 20 runs. Figures 3(b)–(d) show the results for the experiment with data sets with 20 % of labeled patterns. The results consist of the classification error rates for every presented test set, also averaged by 20 runs. The results of each algorithm under comparison (red curves) and the results of the proposed algorithm (blue curves) are put together and shown in Figs. 3(b)–(d).
Considering the experimental results displayed in Fig. 3(a), as expected, all the algorithms tend to degenerate their performance as the labeling percentage provided in the training sets decays. Notice that the proposed algorithm KAOGINCSSL and the DWMNB algorithm have performed similarly throughout all the different label percentages domains, with exception to the experiment with 10 % of labeled patterns where the KAOGINCSSL algorithm presented a better performance. In fact, even when only 20 % of the training patterns are labeled, KAOGINCSSL and DWMNB present similar performance, differentiating by the fact that the proposed algorithm is much more stable, presenting the smallest variance. Regarding the WCEA algorithm, from Fig. 3(a), we see that it is the algorithm that suffers the most as the amount of labeled patterns decreases. Again, when considering 20 % labeled set, in spite of presenting very close result for the average error percentage to the SEA algorithm performance, the WCEA algorithm presents a larger variation on error rate along the stream processing, as can be seem in Figs. 3(b)–(c). The SEA ensemble has the worst performance in this domain.
Now, consider the experiments on the other two artificial domains (Sine and Circle) and the two real domains (Elec2 and Spam), again, each with nine different realizations and classification results taken as the mean over all presented test sets averaged by 20 runs. Figure 4 presents the results.
As can be seen in Figs. 4(a)–(b), the proposed algorithm presents a large advantage on accuracy performance over the other algorithms, considering all the experiment configurations. This advantage, though, seems to decrease when real data sets are considered, which may be due to the fact that artificial domains are constructed in a controlled manner and therefore present some desired characteristics. On the other hand, real data sets present an unorganized scenario, e.g. concept drifts are not welldefined as in the artificial domains. Bearing this in mind, the advantage presented by the proposed algorithm in the artificial domains can be partially explained due to the KAOGINCSSL ability to get rid of past concepts much faster than the ensemble algorithms used for comparison. Now, for better analyzing the real domains, consider also the standard deviation among the presented test sets and taken from a single run to best represent a real situation. Figure 5 presents the standard deviation for the nine considered experiments and the four compared algorithms when processing the real domains Elec2 and Spam.
In real applications, low variance or standard deviation is a desirable feature for a classifier, precisely the lower the standard deviation the more reliable is the classifier performance. Therefore, considering the results for the electricity domain presented in Fig. 4(c); except for the WCEA algorithm, all the others have presented a similar performance, in special for low levels of labeling (<40 %). Here, again the proposed algorithm obtained the best performances for the experiments with more than 50 % of labeled data. Analyzing the standard deviation in Fig. 5(a), it is easy to verify that the proposed algorithm presents the most reliable performance. The DWMNB algorithm presents too higher values of standard deviation indicating high fluctuation in classification performance, in spite of presenting good average accuracy. The SEA ensemble has good accuracy results and low variance.
Regarding the results of the KAOGINCSSL algorithm in the Spam base shown in Fig. 4(d), at a first glance, almost the same trend as in the Elec2 domain can be observed. Because it has presented best average accuracy performance for experiments with more than 50 % labeled patterns and average performance for the rest. In spite of that, the KAOGINCSSL algorithm shows again the most regular performance as depicted in Fig. 5(b). The DWMNB algorithm has also performs well, particularly up to the point where labeled data instances fall off from least than 40 %, but with higher standard deviation than the KAOGINCSSL. Thus, we can say that both KAOGINCSSL and DWMNB algorithms perform similarly. The SEA ensemble presents the lowest average accuracy but small standard deviation, while the WCEA instead of presenting near average mean accuracy also shows too high standard deviation, which discourages both to be used in this domain.
It is also important to notice that all the algorithms, which have used the KAOGSS as transduction algorithm, present good results, especially in the real domains. Therefore, we verify that the proposed transduction algorithm KAOGSS, not only can be used in association to the KAOGINCSSL algorithm, but also can be successfully used in other algorithms as well.
Conclusions
This paper has introduced a semisupervised graphbased algorithm suitable for nonstationary streaming application, particularly when only a small portion of the acquired data presents label. Comparative results on artificial and real data sets performed on the proposed method against three wellknow ensemble methods show that the proposed algorithm outperformed the compared algorithms in most of the experiments. Moreover, the results show that the present spreading label technique can be used successfully in other supervised learning algorithms to support semisupervised classification. Future work includes testing the proposed algorithm with more data sets and comparing to other algorithms with their own spreading label method, as well as comparing the accuracy of the optimal graph as a transductive method against other transductive ones.
References
Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15:1373–1396
Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res 1:1–48
Bertini JR Jr, Lopes A, Motta R, Zhao L (2010) Online classifier based on the optimal Kassociated network. In: Proceedings of the joint conference, III international workshop on web and text intelligence (WTI’10), pp 826–835
Bertini JR Jr, Zhao L, Motta R, Lopes A (2011) A nonparametric classification method based on Kassociated graphs. Inf Sci 181:5435–5456
Bornholdt S, Schuster H (eds) (2003) Handbook of graphs and networks: from the genome to the Internet, 1st edn. WileyVCH, Weinheim
Breve FA, Zhao L, Quiles M, Pedrycz W, Liu J (2011) Particle competition and cooperation in networks for semisupervised learning. IEEE Trans Knowl Data Eng. doi:10.1109/TKDE.2011.119
Chapelle O, Zien A, Schölkopf B (eds) (2006) Semisupervised learning, 1st edn. MIT Press, Cambridge
Chapelle O, Sindhwani V, Keerthi S (2008) Optimization techniques for semisupervised support vector machines. J Mach Learn Res 9:203–233
Cormen T, Leiserson C, Rivest R, Stein C (2009) Introduction to algorithms, 3rd edn. MIT Press, Cambridge
Culp M, Michailidis G (2008) Graphbased semisupervised learning. IEEE Trans Pattern Anal Mach Intell 30(1):174–179
Ditzler G, Polikar R (2011) Semisupervised learning in nonstationary environments. In: Proceedings of international joint conference on neural networks (IJCNN’11), San Jose, CA, USA. IEEE Press, New York, pp 2741–2748
Erman J, Mahanti A, Arlitt M, Cohen I, Williamson C (2007) Offline/realtime traffic classification using semisupervised learning. Perform Eval 64:1194–1213
Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: Proceedings of the Brazilian symposium on artificial intelligence (SBIA’04), vol 3171. Springer, Berlin, pp 286–295
GiraudCarrier C (2000) A note on the utility of incremental learning. AI Commun 13(4):215–223
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference and prediction, 2nd edn. Springer, Berlin
Hettich S, Bay S (1999) The UCI KDD archive. University of California, Irvine, School of Information and Computer Sciences. http://kdd.ics.uci.edu/
Kelly M, Hand D, Adams N (1999) The impact of changing populations on classifier performance. In: Proceedings of the international conference on knowledge discovery and data mining (KDD’99). ACM, New York, pp 367–371
Klinkenberg R, Joachims T (2000) Detecting concept drift with support vector machines. In: Proceedings of the international conference on machine learning (ICML’00). Morgan Kaufmann, San Mateo, pp 487–494
Kolter JZ, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8:2755–2790
Li P, Wu X, Hu X (2010) Mining recurring concept drift with limited labeled streaming data. In: JLMR: workshop and conference proceedings, vol 13, pp 241–252
Lopes AA, Bertini JR Jr, Motta R, Zhao L (2009) Classification based on the optimal kassociated network. In: Proceedings of the international conference on complex sciences: theory and applications (COMPLEX’09). Lecture notes of the Institute for Computer Sciences, SocialInformatics and Telecommunications Engineering (LNICST), vol 4. Springer, Berlin, pp 1167–1177
Masud M, Gao J, Khan L, Han J (2008) A practical approach to classify evolving data streams: training with limited amount of labeled data. In: Proceeding of the international conference on data mining (ICDM’08)
Minku L, White A, Yao X (2010) The impact of diversity on online ensemble learning in the presence of concept drift. IEEE Trans Knowl Data Eng 22:730–742
Narasimhamurthy A, Kuncheva L (2007) A framework for generating data to simulate changing environments. In: Proceedings of the international artificial intelligence and applications (ICAIA’07), pp 384–389
Quiles M, Zhao L, Alonso RL, Romero RAF (2008) Particle competition for complex network community detection. Chaos 18:033107
Quinlan JR (1993) C4.5 programs for machine learning, 1st edn. Morgan Kaufmann, San Mateo
Schaeffer S (2007) Graph clustering. Comput Sci Rev 1:27–34
Schlimmer J, Granger R (1986) Beyond incremental processing: tracking concept drift. In: Proceedings of the association for the advancement of artificial intelligence (AAAI’86). AAAI Press, Menlo Park, pp 502–507
Street N, Kim Y (2001) A streaming ensemble algorithm (SEA) for largescale classification. In: Proc int’l conf knowledge discovery and data mining (KDD’01). ACM, New York, pp 377–382
Sung J, Kim D (2009) Adaptive active appearance model with incremental learning. Pattern Recognit Lett 30:359–367
Syed N, Liu H, Sung K (1999) Handling concept drift in incremental learning with support vector machines. In: Proceedings of the international conference on knowledge discovery and data mining (KDD’99), pp 272–276
von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17:395–416
Wang H, Fan W, Yu P, Han J (2003) Mining conceptdrifting data streams using ensemble classifiers. In: Proc international conference on knowledge discovery and data mining (KDD’03), pp 226–235
Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach Learn 23(1):69–101
Yang C, Zhou J (2008) Nonstationary data sequence classification using online class priors estimation. Pattern Recognit 41:2656–2664
Yu Y, Guo S, Lan S, Ban T (2008) Anomaly intrusion detection for evolving data stream based on semisupervised learning. In: Proceedings of the international conference on advances in neuroinformation processing (NIPS’08), pp 571–578
Zhang P, Zhu X, Guo L (2009) Mining data streams with labeled and unlabeled training examples. In: Proceedings of the ninth IEEE international conference on data mining (ICDM’09). IEEE Press, New York, pp 627–636
Zhu X (2008) Semisupervised learning literature survey. Tech Rep 1530, ComputerScience, University of WisconsinMadison
Zhu X (2005) Semisupervised learning with graphs. Tech Rep Doctoral Thesis, School of Computer Science, Carnegie Mellon University
Acknowledgements
This work is supported by the Brazilian National Research Council (CNPq) and by the São Paulo State Research Foundation (FAPESP).
Author information
Authors and Affiliations
Corresponding author
Additional information
This is a revised and extended version of a previous paper that appeared at WTI 2010 (III International Workshop on Web and Text Intelligence) and has been recommended to JBCS.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Bertini, J.R., Lopes, A.d.A. & Zhao, L. Partially labeled data stream classification with the semisupervised Kassociated graph. J Braz Comput Soc 18, 299–310 (2012). https://doi.org/10.1007/s1317301200728
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1317301200728
Keywords
 Semisupervised online classification
 Incremental learning
 Graphbased learning
 Concept drift