Open Access

Partially labeled data stream classification with the semi-supervised K-associated graph

  • João Roberto BertiniJr.1Email author,
  • Alneu de Andrade Lopes1 and
  • Liang Zhao1
Journal of the Brazilian Computer Society201218:72

https://doi.org/10.1007/s13173-012-0072-8

Received: 15 June 2011

Accepted: 22 March 2012

Published: 17 April 2012

Abstract

Regular data classification techniques are based mainly on two strong assumptions: (1) the existence of a reasonably large labeled set of data to be used in training; and (2) future input data instances conform to the distribution of the training set, i.e. data distribution is stationary along time. However, in the case of data stream classification, both of the aforementioned assumptions are difficult to satisfy. In this paper, we present a graph-based semi-supervised approach that extends the static classifier based on the K-associated Optimal Graph to perform online semi-supervised classification tasks. In order to learn from labeled and unlabeled patterns, here we adapt the optimal graph construction to simultaneously spread the labels in the training set. The sparse, disconnected nature of the proposed graph structure gives flexibility to cope with non-stationary classification. Experimental comparison between the proposed method and three state-of-the-art ensemble classification methods is provided and promising results have been obtained.

Keywords

Semi-supervised online classificationIncremental learningGraph-based learningConcept drift

1 Introduction

Recently, graph-based (also referred to network-based) algorithms applied to data mining tasks have attracted great attention in both theoretical research and practical applications [5]. This growing interest is mostly justified due to the advantages provided by graph representation, such as revealing topological structure of input data and the ability of identifying arbitrary shapes of data clusters [27]. In such graph-based algorithms, each vertex of the graph represents a data pattern (data instance) and the edges stand for some relation of similarity between vertices. In order to reveal significant relations within a data set, the following rule is usually considered for establishing connections between data patterns: the higher the similarity among data, the higher the probability of connection [39]. Stated in this way, nearby patterns tend to be heavily linked together while distant patterns may form a sparse structure. This property has been extensively explored using graph-based solutions, especially considering unsupervised tasks like clustering [32] and dimensionality reduction [1]. Only recently graph-based classification has been addressed, usually by the wrap of semi-supervised learning [38].

Semi-supervised learning methods concern the problem of automatic classification considering data sets with a small number of labeled data and a large amount of unlabeled data [7]. Such approach relies on the fact that labeled data are difficult to be gathered and often are associated with high costs, while unlabeled data are abundant in most applications and generally easy to be collected. Moreover, the manually labeling process is not always reliable or practicable. For example, consider obtaining enough labeled data to train a classifier for a spam detection task (i.e. classifying spam and valid email). Such application design (1) incurs cost in paying an expert or a group of users to label what they call spam from what they consider real email; (2) may result in inconsistencies if we accept all human categorization. For instance, an email message may be considered as a spam by some people, but it may be considered as a valid email by others; (3) not to mention the time required to manually label enough data to train a regular supervised learning method.

A spam detection application is really a stream classification problem, in the sense that the classifier needs to classify new patterns at the time they arrive [35]. In this kind of applications, the underlying data distribution changes over time, and such changes often make the model built on old data inconsistent to the newly arrived data. This problem, known as concept drift [34], requires frequent updating of the model. Summarizing, we have a classification problem which consists of a data stream where few instances are labeled and data distribution may change over time. This scenario poses a challenging task for machine learning because it presents too few labeled data along the stream to apply a supervised incremental algorithm and the presence of concept drift disables the use of static classifiers. In fact, only recently such applications have been properly addressed due to the concept of learning through both labeled and unlabeled data and the development of semi-supervised learning strategies.

In the development of semi-supervised learning algorithms, many efforts have been made on the use of a clustering algorithm to group the patterns and further spread the labels. When considering this approach, the K-means algorithm is a natural choice. Li et al. [20] proposed a tree-based algorithm which uses the K-means to spread labels at the leaves of a tree. Masud et al. [22] proposed an ensemble of micro-clusters, obtained by using the K-means algorithm, then instances are classified according to the K-nearest neighbor rule. Ditzler and Polikar [11] proposed an ensemble of classifiers, named WEA, which are trained with labeled patterns only. Then, unlabeled data and the K-means algorithm are used to generate a mixture of Gaussian models for further adjusting the weights of each classifier. Zhang et al. [37] use the semi-supervised SVM [8] allied to a version of the K-means, referred to as relational K-means, to construct new features to the labeled examples by using information extracted from unlabeled instances. Some investigations have been made to tackle specific problems, e.g. Erman et al. [12] proposed a method to perform traffic classification in computer networks with partially labeled data. Their method uses a clustering algorithm, such as K-means, to obtain the clusters and then, the labels are spread using the maximum likelihood estimation. The clusters that remain unlabeled are likely to be an undefined group. Also regarding computer network, Yu et al. [36] have considered the problem of intrusion detection. They employ a strategy similar to the K-means by grouping the labeled data and then, the labels are spread to the whole data set according to the distances from the clusters to unlabeled patterns. Finally, a SVM is trained to detect intrusion.

To the best of our knowledge, graph-based approach has not been considered to tackle streaming classification problems where data are partially labeled; although it is successfully applied to semi-supervised learning, especially to the transduction problem [2, 6, 10, 25]. In view of the recent developed graph-based nonparametric classification method and its good performance on stationary data sets [4, 21]; we had proposed a non-stationary version with initial results reported in Ref. [3]. In this paper, we propose an extended version to be applied in the context of non-stationary stream of partially labeled data. The aforementioned graph-based method is based on representing the training set as a special graph, referred to as K-associated graph. The K-associated graph is able to represent similarity relations among data instances and the purity of a component (connected subgraph) is able to represent the data topology. Purity characterizes the degree in which instances of different classes are mixed in a same region of the data space. In this work we propose a new constructing procedure for the K-associated graph that takes into account partially labeled sets. Also, this work shows how the graph is updated along the time to allow data stream processing.

The remainder of the paper is organized as follows: In Sect. 2, we briefly describe the problem of concept drift and also a toy example to illustrate a scenario where incremental learning is applicable. Section 3 presents the proposed method for non-stationary partially labeled stream classification. This section is further divided into four subsections, where Sect. 3.1 first introduces the K-associated graphs and the K-associated optimal graph. The new method for constructing the aforementioned graphs from partially labeled data sets is described in Sect. 3.2. Moreover, Sect. 3.3 briefly treats the static KAOG classifier [4] and Sect. 3.4 details how the graph is updated over time. Section 4 presents the experimental results concerning the performance comparison between the proposed algorithm and three well-know fully supervised streaming ensemble classifiers on non-stationary partially labeled benchmarks. Section 5 concludes the paper and discusses some future works.

2 Background

The non-stationary nature of data streams cause a phenomenon called concept drift (or also termed concept substitutions, revolutionary changes, population drift) [17, 24, 28, 34], in which concept refers to the data distribution in a given period of time. In the literature, however, the term concept drift has been used in reference to different phenomena relating to drop down the classifier accuracy performance [31]. According to Kelly et al. [17], concept drift occurs due to alterations on the following probabilities of data production:
  • A priori of classes P(ω1),…,P(ω M ), i.e. alteration on the relative size of a given class or the appearance of new classes.

  • Conditional P(xω i ), i.e., changing on class definition. For example, changes in the shape of a class.

  • Conditional a posterioriP(ω i x), i.e., modification on some of the attributes;

In general terms, concept drift can be characterized according to the variation of the concepts mainly regarding two features, velocity and recurrence along the time. Basically, in the former, a concept drift can be divided into gradual drift and abrupt drift; while in the latter, a concept drift is recurrent if past concept turns to be current concept. Both kinds of concept drift are sketched in Fig. 1.
Fig. 1

Frequent concept drift characterizations, (a)–(d), abrupt drifts with recurrence; (e)–(g) gradual drift and (h) example subjugated to the velocity it is modified

In Fig. 1, the (blue) rectangles represent the instances that belong to class ω1 and the (red) circles represent the instances belonging to class ω2. Consider Figs. 1(a)–(d) as a sequence of data distributions of an application presented in time, initiating at t0. The concept drifts that occur between distributions of Figs. 1(a) and 1(b), as well as between Figs. 1(b) and 1(c), are abrupt. Also notice that the distribution shown by Fig. 1(c) is similar to that in Fig. 1(a), which mean that the distribution at time t0 in Fig. 1(a) occurs again at time tj+1, after experiencing a completely different distribution (Fig. 1(b)). This phenomenon characterizes a recurrent concept. As the time line shows, from Figs. 1(a) and 1(d), each distribution can, eventually, remain static for a given period of time, e.g., the initial distribution remains static from t0 to t i . Nonetheless, on the next iteration ti+1, the distribution can be totally altered, i.e. an abrupt drift occurs. The drift between distributions in Figs. 1(c) and 1(d) is also considered abrupt, in spite of being less severe than the previous one. Consider now a situation where two groups of data from different classes cross each other along time, depicted in order in Figs. 1(e)–(g), from an initial distribution (Fig. 1(e)) to a final one (Fig. 1(g)) with Fig. 1(f) corresponding to an intermediate distribution. In such a scenario, the distribution varies smoothly throughout the time, which characterizes a gradual drift. At last, let Fig. 1(h) represent a distribution determined by a rotating hyperplane along the time. If the hyperplane is rotated by π/4 regularly at a given period of time, the drift is characterized as gradual and recurrent at every eight alterations of the hyperplane. However, if the angular velocity rate is increased, say to π, the drift now can be considered abrupt. This demonstrates that it is surprisingly difficult to accurately characterize concept drifts considering only velocity and recurrence. In view of this problem, many researchers have proposed different drift categories; for a recent work, refer to Ref. [23].

In spite of characterizing concept drift, the main concern is that, most of the time, the variation in the underlying data distributions degenerates the performance of the classifier in use. The need for replacing a classifier due to the drop in accuracy, caused or not by a concept drift, is called virtual concept drift [18]. The trivial way to treat virtual drift is to replace the low accuracy classifier by a new one. However, such strategy brings at least three prohibitive drawbacks, (i) retraining new classifiers usually is computationally expensive; (ii) detecting when the current classifier is no longer useful is quite challenging, mainly due to the natural fluctuations in performance that can be confused with real concept drift; (iii) selecting what data should be used to train the new classifier is also a hard task. Fortunately, incremental learning algorithms can be applied to provide practical solutions to tackle classification problems on non-stationary domains. Such an approach enables a classifier to acquire knowledge during application phase, updating the model with new data, and without explicitly retraining itself [14, 30].

For clarifying the advantages of an incremental classifier over a static one in non-stationary domains, consider the following experiment with the artificial data set known as banana set (see Fig. 2(a)). The experiment consists of comparing both approaches, static and incremental, of the classifier based on the K-associated graph. For doing so, in both cases, the classifier is trained with a limited subset (set “s1” in Fig. 2(a)). The rest of the set is divided into seven groups and used as test sets that are sequentially presented to the classifier. In Fig. 2(a) “s1” is the original training set and the others correspond to the first, second, and seventh test sets, respectively. The results are averaged over 10 runs, at each run, an optimal graph is built considering 400 examples (200 of each class) randomly chosen from the training data group. After training, the test examples are chosen obeying the group sequence, one-by-one, 200 examples are randomly chosen from each group (100 for each class), then, the next group takes place and so on.
Fig. 2

(a) Artificial data set (Banana set) divided into sequential groups for simulating a non-stationary domain. The highlighted examples represent those chosen examples in a run. (b) Cumulated accuracy for static and incremental learning using the KAOG classifier

Figure 2(b) shows the results of the comparison between the K-associated static and incremental classifiers. The significant difference between them is due to the fact that the static classifier no longer learns with new instances, however the incremental classifier is able to learn during classification phase. The presented incremental learning process is analogous to the linearization technique widely used to study local properties of non-linear systems. Specifically, linearization of a neighborhood of a certain point corresponds to subset selection in incremental learning. Non-linearity of the system corresponds to twisted shape of classes and changing of data distribution over time. In a non-linear systems, linearization usually can obtain good approximation if the neighborhood under analysis is small. For the same reason, we expect that good classification results can be obtained by updating the network with small data subset each time.

3 The semi-supervised K-associated optimal graph

The semi-supervised K-associated graph, proposed here, consists of a modification of the K-associated graph [4] to deal with both labeled and unlabeled data during the graph construction procedure. Therefore, in order to introduce the semi-supervised version, a brief revision of the K-associated graph is presented in Sect. 3.1. It is followed by the semi-supervised K-associated graph construction presented in detail in Sect. 3.2. Both supervised and semi-supervised K-associated optimal graphs can be seen as the training process for the KAOG classifier which uses the components of the graph and their purities to classify new data instances, as will be exposed in Sect. 3.3.

3.1 The K-associated graph and the K-associated optimal graph

A K-associated graph is constructed from a vector-based data set X={x1,…,x N } by representing each data instance x i =(xi1,xi2,…,x ip ,c i ) as a vertex v i with its associated class label c i , where c i Ω={ω1,ω2,…,ω M } and M is the number of classes in the problem. The graph construction resembles to a KNN graph, due to the use of a predefined number of neighbors, K, that each vertex must connect. Although the K-associated graph does differ from the KNN approach by the fact that amongst the possible K neighbors of a vertex v i , it can only be connected to neighbors of the same class as v i . Hence, we consider the label-independent and the label-dependent K-neighborhood of vertex v i . The former is simply the set of vertices that represents the K nearest neighbors of the instance x i according to a given measure and will be noted by \(\varLambda_{v_{i},K}\). The latter comprises only the vertices with the same class as v i among its K nearest neighbors, and is defined as \(\varDelta _{v_{i},K} = \{v_{j}\mid v_{j} \in\varLambda_{v_{i},K}\ \mathrm{AND}\ c_{i} = c_{j}\}\).

In a formal way, the K-associated graph is defined as a directed graph G=(V,E) which consists of a set of labeled vertices V and a set of edges E between them, where an edge e ij =(v i ,v j ) connects vertex v i with vertex v j if and only if \(v_{j} \in \varDelta _{v_{i},K}\). As a consequence, only vertices of the same class can be connected. The resulting K-associated graph can be viewed as a set of disjoint subgraphs or componentsC={C1,…,C α ,…,C R }. Each component C α is composed by vertices of a single class, thus each component represents a single class, which we refer to the label of component C α as \(\hat{C}_{\alpha}\). The number of components R varies according to the magnitude of K, but always lies in the range NRM, with N being the number of vertices in the training set and M the number of classes. Higher values of K induce fewer and larger components in the constructed graph, while lower values lead to small sized ones. This wire mechanism leads to a graph with some important features: (i) By varying K, different graphs can be generated, and as the value of K increases, the number of components decreases monotonically to the number of classes. (ii) The total number of edges among the vertices of a component C α is proportional to K and can be at most equal to KN α , where N α is the number of vertices in component C α . (iii) This maximum value is only achieved if all vertices in the neighborhood of any vertex of the component have the same class. Likewise, nearby vertices of other classes decrease the number of connections of the given component. Thus, one can define a measure of “purity” for components, as explained ahead.

Let the degree d i of a vertex v i be defined as the sum of the connections it receives (in-degree) and the connections it performs (out-degree) to other vertices, so \(d_{i} = d_{i}^{\mathrm{in}} + d_{i}^{\mathrm {out}}\). Also, consider the average degree taken for component C α be defined by \(D_{\alpha} = 1/N_{\alpha} \sum_{v_{i} \in C_{\alpha}} d_{i}\). According to the way that the K-associated graph is constructed, a vertex can perform at most K connections, thus, the maximal total out-degree of component C α is KN α ; symmetrically, the total in-degree is also KN α , resulting in average degree being equal to 2K. Hence, a key idea is to use the ratio defined in Eq. (1) as a measure of “purity” for component C α , because it quantifies how intertwined a component is with vertices of other classes,
$$ \varPhi_{\alpha} = \frac{D_{\alpha}}{2K}$$
(1)

In this way, Φ α =1, if and only if, for every v i in the component C α , all the K neighbors have the same class label of v i . On the other hand, if there exists noise or two or more classes are mixed together, vertices in this region are unable to make their K connections due to the existence of vertices of other classes in the neighborhood of some vertices. In the latter case, the more mixing the components are, the lower their average degrees D α and consequently their respective purities Φ α are.

Clearly, the structure of a K-associated graph depends on the value of K and on the nature of the input data set. Also, K-associated graphs formed with different K will present different components with different purity values. Bearing this in mind, a suggestive idea is to obtain a graph with the best organization of components without using a unique value of K, i.e., each component has its own optimal value of K, denoted as K α for component C α . Therefore, the rationale for obtaining the optimal graph is to construct K=1,…,Kmax associated graphs while keeping the best components found at each K throughout this process. Let β also be an index of component, therefore, a component \(C_{\beta }^{(K+z)}\) from the (K+z)-associated graph will replace all components from the K-associated graph that satisfy Eq. (2), for some integer z≥1 and (z+K)≤Kmax,
$$ \varPhi^{(K+z)}_{\beta} \ge\varPhi^{(K)}_{\alpha}\quad \mbox{for all}\ C_{\alpha}^{(K)} \subseteq C_{\beta}^{(K+z)}$$
(2)

The optimal graph improves the representation of the training set and provides the best configuration of components according to their purities. It corresponds to the best graph organization regarding the purity measure.

3.2 The semi-supervised K-associated optimal graph

Consider now obtaining the optimal graph from a partially labeled set X. It is easy to see that it is not possible to obtain the aforementioned graph through the previous description due to the presence of unlabeled patterns. Therefore, we propose here the semi-supervised construction of the K-associated optimal graph.

The problem addressed here regards the absence of enough labeled data in a given data set to employ a regular supervised method. Therefore, it is necessary to consider a semi-supervised method in order to induce a classifier from both labeled and unlabeled patterns. Hence, consider the data set X={(x1,c1),…,(x l ,c l ),xl+1,…,x N } with l labeled patterns (x i ,c i ) and Nl unlabeled patterns x (or (x j ,)). As its supervised counterpart, the semi-supervised K-associated optimal graph construction involves creating a sequence of semi-supervised K-associated graphs. The main difference between the supervised and semi-supervised K-associated graphs can be stated in relation to the set of neighbors, to which each vertex connects. Instead of considering only the label-dependent set (\(\varDelta _{v_{y},K}\)), here, each vertex v i connects to all vertices in the set \(\varGamma_{v_{i},K} = \{v_{j} \mid v_{j} \in\varLambda_{v_{i},K}\ \mathrm{AND}\ (c_{j} = c_{i}\ \mathrm{OR}\ c_{i} =\emptyset\ \mathrm{OR}\ c_{j} = \emptyset) \}\). This set encompasses the K nearest neighbors of v i whose classes are not different from the class of v i . This means that, among its K nearest neighbors, v i connects to those vertices which belong to the same class of v i or to those with no label. If v i itself does not have a class label, it connects to all the K nearest neighbors without considering their classes.

As a consequence of connecting unlabeled vertices to labeled vertices regardless to their classes, components with more than one class may be formed. However, having components with more than one class precludes the classifier to make decisions. In other words, each component must be formed by vertices belonging to a single class and different from null. Thus, to overcome this problem, we propose splitting those components with vertices associated to two or more classes. For splitting a component, the rationale is to cut a few edges in order to end up with separated well-connected clusters of vertices. In other words, this is a min-cut problem, which can be resolved, for example, by the Ford–Fulkerson algorithm [9] for the two-class case. However, as we consider multi-class classification, there exists the problem that a component might be composed by vertices from more than two classes. For this reason, we propose cutting the component based on the purity of vertex, defined as d i /2K, where d i stands for the degree of v i . Again, consider Wi,j the distance, used to construct the graph, between patterns x i and x j . Thus the proposed separation approach consists of successive removing the edge with minimum value of cut from the component, as defined in Eq. (3). Otherwise stated, the next edge to be removed (v a ,v b )C α must satisfy cuta,b=min(cuti,j) (v i ,v j )C α .
$$ \mathrm{cut}_{i,j} = \min \biggl(\frac{d_i}{2K},\frac{d_j}{2K} \biggr)\frac{1}{W_{i,j}}$$
(3)

The cutting process in the component C α finishes until it is separated into single class components. The rationale behind the criterion is that by cutting the edges that connects low purity vertices and whose respective patterns are distant from each other, it is more likely to obtain separated well-connected components. In fact, low purity vertices are usually found in boundary regions between components of different classes in supervised tasks. However, in the semi-supervised scenario, purity itself can be a misleading measure due to high connection probability of the unlabeled vertices. Therefore the distance weight in Eq. (3) favors cutting the edges with highest distance in the component.

Algorithm 1 details the construction of the semi-supervised K-associated optimal graph. The function findComponents() determines the graph components by implementing a breadth-first search [9]. Then, the components having vertices belonging to more than one class are separated by the function splitNonSingleClassComponents(). This function implements the cutting procedure described earlier and returns two or more single class components, which can include components without a class label. The next step consists of spreading the labels within every component by calling the function spreadLabel(). After this stage, all vertices in any given component are labeled with a single class label or are unlabeled. To finish the K-associated graph construction, the purity measure is calculated through the function purity() for all components. At the end of this process, if K=1, then the graph generated so far is the optimal graph and it is assigned to \(G^{(\mathrm {opt})}_{s}\). Otherwise, each component of the current K-associated graph, \(G^{(K)}_{s}\), is compared to the components in the graph, \(G^{(\mathrm{opt})}_{s}\), having the same vertices (condition in line 20). The new component will substitute the corresponding old ones if the purity is increased or maintained for the labeled components. The process goes on by increasing K and generating a new K-associated graph until the number of components in this new graph matches the number of classes in the problem (R=M).
Algorithm 1

Semi-supervised K-associated Optimal Graph construction from partially labeled set—KAOGSS

In summary, the main modifications in the original K-associated optimal graph construction algorithm [4] include connecting each vertex to all its neighbors with the same class or without a class label (line 6) and merging every component with empty class (in the Algorithm 1, \(\hat{C}_{\alpha}\) stands for the class of a component) to another component, independent of purity. Notice that the present algorithm not only can construct the K-associated optimal graph, but also, by doing so, can spread the labels throughout the whole training set. Therefore, the KAOGSS algorithm is a transductive method.

3.3 The KAOG classifier

This section presents the nonparametric classifier that uses the K-associated optimal graph structure to infer the class of new patterns, for more details, please refer to Ref. [4]. In order to present how a new pattern is classified, consider again a training pattern x i represented by x i =(xi1,xi2,…,x ip ,c i ), which x i represents the ith training pattern with c i its associated class label, in a M-class problem c i Ω={ω1,ω2,…,ω M }. In the same way, a new pattern is defined as y=(y1,y2,…,y p ), excepted that now its class label must be estimated. Consider also the set of components of the optimal graph C={C1,…,C α ,…,C R }, where R is the number of components and RM. In order to classify the new pattern y, we must firstly transform it to a vertex, noted by v y , then connect it into the graph as explained ahead. Consider K L the largest value of K in the K-associated optimal graph, or equivalently the K value from the last obtained component. For every new pattern y, we do:
  1. 1.

    Calculate the distances between the new pattern y and all elements x i in the training set

     
  2. 2.

    Find the K L nearest neighbors of y; noted in ascending order as \(\bar{\varLambda}_{v_{y},K_{L}} = \{\mathbf{x}_{(1)},\mathbf {x}_{(2)},\ldots,\mathbf{x}_{(k)},\ldots, \mathbf{x}_{(K_{L})}\}\)

     
  3. 3.
    For k=1 to K L
    • Locate the vertex (and component) that represents x(k), say v j C α

    • If kK α then
      • Connect v y to v j

     
Once the new vertex v y is connected to the K-associated graph, its class label is estimated using the Bayes theory [15]. The connection established during classification are temporary, i.e. they will not be incorporated into the graph structure. The posterior probability of a new vertex v y to belong to component C α given the set of label-independent neighbors of v y , noted by \(\varLambda_{v_{y}}\), is defined by Eq. (4),
$$ P(v_y \in C_{\alpha}\mid \varLambda_{v_y}) =\frac{P(\varLambda_{v_y}\mid v_y \in C_{\alpha})P(v_y \in C_{\alpha })}{P(\varLambda_{v_y})}$$
(4)
Knowing that each component C α has been formed in a particular K-associated graph among the various generated graphs, we must consider the particular value of K in which C α was formed, noted by K α . Let \(\varLambda_{v_{y},K_{\alpha}}\) represent the set of K α nearest neighbors of v y . Thus, in order to estimate the probability \(P(\varLambda_{v_{y}}\mid v_{y} \in C_{\alpha })\), one must consider the fraction among the connections made with component C α over all possible K α connections, as shown in Eq. (5),
$$ P(\varLambda_{v_y}\mid v_y \in C_{\alpha}) =\frac{|\{\varLambda_{v_y,K_{\alpha}}\}|}{K_{\alpha}}$$
(5)
The prior probabilities P(v y C α ) are defined as the normalized purities among the components to which v y is connected as \(P(v_{y} \in C_{\alpha}) = \varPhi_{\alpha} / \sum_{N_{v_{y},C_{\beta}} \neq 0} \varPhi_{\beta}\), where \(N_{v_{y},C_{\beta}}\) represents the number of connections v y has to component C β . Accordingly, the normalizing term is given by Eq. (6),
$$ P(\varLambda_{v_y}) = \sum _{N_{v_y,C_{\beta}} \neq0} P(\varLambda_{v_y}\mid v_y \in C_{\beta})P(v_y \in C_{\beta})$$
(6)
In many cases, there are more components than number of classes, according to Bayes optimal classifier, it is necessary to sum the posterior probabilities of all components corresponding to the same class. Finally the largest value among the found posterior probabilities reflects the most probable class for the new pattern, according to Eq. (7), where φ(y) stands for the class attributed for instance y,
$$ \varphi(\mathbf{y} ) = \mathop{\mathrm{arg\,max}} \bigl\{ P (\mathbf{y}\mid \omega_1 ),\dots, P (\mathbf{y}\mid \omega_M )\bigr\}$$
(7)

3.4 Classifying partially labeled data stream

This section exposes how the proposed graph-based structure copes with non-stationary classification. Consider a stream S={X1,Y1,…,X T ,Y T }, where X t ={(x1,c1),…,(x l ,c l ),xl+1,…,x N } contains labeled and unlabeled patterns; while Y t ={y1,…,y M } is formed with unlabeled patterns only. Such streams may present concept drift at any time. Therefore, an online classifier should have the ability to evolve by adding new knowledge along time without being retrained. In the proposed approach, this dynamical evolution is done by considering a dynamic graph, named principal graph, which grows with the frequent addition of components provided by the K-associated optimal graph formation (Algorithm 1) along the data stream processing. Algorithm 2 details the proposed approach.
Algorithm 2

Incremental Algorithm KAOGINCSSL

Algorithm 2 presents the KAOGINCSSL algorithm, which processes a data stream S composed of partially labeled and unlabeled data sets. The function nextChunk(S) removes the next set from stream S and put it into the variable Z used to represent a chunk of data. After assigning the next set to the variable Z, the algorithm determines if the set is partially labeled to be considered for training/updating (i.e. if the set has enough labeled patterns, e.g., at least 5 %) through the function isPartiallyLabeled(Z) which returns “true” if Z is partially labeled and “false” otherwise.

Therefore, the tasks of the algorithm are twofold, (i) incorporate new knowledge from both labeled and unlabeled patterns to subdue concept drift and (ii) predict the label for the unlabeled patterns presented in unlabeled sets. In the former task, the objective is to incorporate new knowledge from the recent obtained partially labeled set, thus a semi-supervised K-associated optimal graph is derived using Algorithm 1 (KAOGSS). As explained in Sect. 3.2, the KAOGSS algorithm generates the K-associated optimal graph spreading the labels to all vertices and the resulting graph is composed of several disjoint components. These new components are then merged to the principal graph (G P ), which is composed of independent components. However, the addition of new components increases the size of the principal graph, which may increment classification error and time. To avoid this problem, the principal graph should not grow unlimitedly, thus, old and unused components should be removed.

The task of classifying new patterns takes place if the set at hand is unlabeled, and it is resolved by simply applying the KAOG classifier using the principal graph to classify unlabeled vertices, as presented in Sect. 3.3. Component removal takes place during classification phase by applying a method named disuse rule. This rule establishes a maximum number of consecutive classifications in which a component is allowed to be unused (i.e. do not receive any connections during classification). The maximum value accepted is set by the parameter τ. When a component remains out of use after τ patterns are classified, it is removed from the principal graph. The algorithm finishes when the whole stream has been processed, i.e. S=.

An important feature of stream classification algorithms is its ability to process data in a reasonable time, which includes the tasks of training, updating and classifying. The proposed algorithm consists of the following phases of data processing: (i) training or updating the principal graph, (ii) classifying new data and, (iii) removing unused components.

In the first phase, training or updating the principal graph is required whenever a partially labeled set X is presented. Let there be N instances in the set X; training (or updating) corresponds to build a semi-supervised K-associated optimal graph (Algorithm 1). As estimated in Ref. [4], the complexity order to build a supervised K-associated optimal graph is about O(N2)—due to distance matrix calculation. Also, it has been shown that the K-associated optimal graph construction scales better than the C4.5 and the Gibbs Sampling algorithms. Taking into account that the only addition in processing time in the semi-supervised version is the need to verify whether a component presents more than one class and, in this case, the algorithm cuts out some edges to divide it into some single class components. Knowing that the process of finding and cutting a component by using the proposed technique depends on the number of edges and vertices in the component (O(N α +E α )), where N α and E α are the number of vertices and edges in the component C α , respectively. Since K-associated graphs are sparse, thus, N α or E α is much smaller than the number of vertices in the whole graph. Allied to the fact that few components need to be partitioned (those components, which are composed of vertices from more than one class), it can be verified that the computational order of this phase remains O(N2).

Now we consider the second phase, the order of classifying a new pattern has also been estimated in the aforementioned work as O(N p ), due to the distance calculation among the new vertex and the N p vertices in the principal graph. Here, it is important to mention that there exist strategies for lowering the order, for example locating the nearest components firstly, instead of actually searching for the vertices neighbors. Such strategy decreases the computational cost to O(N cp ), with N cp being the number of components in the principal graph, and N cp N p . At last, component removal can be done by the disuse rule, which is done by simply checking the time parameter of each component, therefore, it has the order of O(N cp ).

4 Experimental results

The experimental results are obtained considering five non-stationary data sets, with three of them generated artificially, SEA [29], Sine and Circles [13] and the other two are real data, Spam and Elec2 [16]. For all the experiments, Algorithm 1 is used to spread the label to all the training sets.

In order to simulate a stream of partially labeled data and qualify how the algorithms react with different amount of labeled patterns, we have generated nine experiments for each domain, differing from each other regarding to the percentage of labeled patterns in the training sets. With the percentages of labeled patterns lying in the set {90 %, 80 %, 70 %, 60 %, 50 %, 40 %, 30 %, 20 %, 10 %}. Each stream is presented as a sequence of chunks of data, alternating between a partially labeled set and a fully unlabeled set. The partially labeled sets are used for training (or updating) the classifiers, while the fully unlabeled sets are used as test sets to estimate the classification accuracy of the algorithms. Here, we use the real labels of the test sets to estimate the classifier accuracy. Among the artificially generated stream data, the SEA domain is presented along 500 realizations of training sets with 60 patterns and tests set with 40 patterns. The other two streams, Circle and Sine, are presented along 200 realizations of alternating training and test sets, each of them with 25 patterns. Regarding the two real data sets, we consider a real situation where there are not enough data to use for testing. Therefore the same set is firstly used for testing and then for training. The Elec2 domain represents the electricity price fluctuation gathered during a given period (for details, please refer to Ref. [13]). The domain is composed of 45,312 patterns, which can be divided into 134 chunks of 336 patterns (except for the first set with 288), representing a week of price variation. The spam base is composed of 4601 patterns representing spam and real mail, the chunks, in this case, are defined with 45 patterns except for the initial with 101, and the stream is presented along 100 realizations.

Regarding the algorithms under comparison, three of them are ensemble algorithms chosen due to their high adaptability. The SEA ensemble [29] consists of a pool of C4.5 classifiers [26] and works by evaluating each of the decision trees, whose output is used to decide the ensemble output by a simple majority voting scheme. Every time a training batch arrives, a new decision tree is trained and it replaces the tree in the ensemble with the major number of mistakes up to that point. Another algorithm implemented for comparison is the DWM [19], which consists of an ensemble method that virtually can be composed by any classifiers. Briefly, the DWM algorithm adds a new incremental classifier to the ensemble every time an error is committed by the ensemble. Each single classifier has a weight that is decreased by a determined factor β every time it commits an error. For controlling the size of the ensemble, at every p iteration, those classifiers whose weight is less than a predefined threshold θ are removed. As recommended by the authors, the incremental Naive Bayes (see Ref. [19] for details and references) has been used as base classifier, therefore we note DWM-NB hereafter. The third algorithm, proposed by Wang et al. [33], is also an ensemble that uses a decision tree as base algorithm, similar to SEA, but with weighted classifiers. The weight of each base classifier is estimated by its classification accuracy in a test set. Therefore, the weight of base classifier h k is given by w k =MSE r −MSE i , where MSE i corresponds to the generalization error and can be obtained through a cross-validation process; while MSE r is the estimated error given the new data set, and can be calculated as \(\mathrm{MSE}_{r} =\sum_{\omega_{j} \in\varOmega} p(\omega_{j}) (1 - p(\omega_{j}))\), with p(ω j ) the percentage of instances belonging to class ω j .

Figure 3(a) shows the accuracy for the tested algorithms on the nine different experiments regarding the percentage of labeled pattern in the training sets for the SEA domain. Each experiment result shows the classification accuracy on a test set averaged by 20 runs. Figures 3(b)–(d) show the results for the experiment with data sets with 20 % of labeled patterns. The results consist of the classification error rates for every presented test set, also averaged by 20 runs. The results of each algorithm under comparison (red curves) and the results of the proposed algorithm (blue curves) are put together and shown in Figs. 3(b)–(d).
Fig. 3

Experiments with the SEA domain, (a) average accuracy for nine experiments with different percentages of labeled patterns, from 10 % to 90 %; and performance comparison through the SEA presentation with 20 % labeled patterns, between the KAOGINCSSL and (b) SEA, (c) WCEA and (d) DWM-NB

Considering the experimental results displayed in Fig. 3(a), as expected, all the algorithms tend to degenerate their performance as the labeling percentage provided in the training sets decays. Notice that the proposed algorithm KAOGINCSSL and the DWM-NB algorithm have performed similarly throughout all the different label percentages domains, with exception to the experiment with 10 % of labeled patterns where the KAOGINCSSL algorithm presented a better performance. In fact, even when only 20 % of the training patterns are labeled, KAOGINCSSL and DWM-NB present similar performance, differentiating by the fact that the proposed algorithm is much more stable, presenting the smallest variance. Regarding the WCEA algorithm, from Fig. 3(a), we see that it is the algorithm that suffers the most as the amount of labeled patterns decreases. Again, when considering 20 % labeled set, in spite of presenting very close result for the average error percentage to the SEA algorithm performance, the WCEA algorithm presents a larger variation on error rate along the stream processing, as can be seem in Figs. 3(b)–(c). The SEA ensemble has the worst performance in this domain.

Now, consider the experiments on the other two artificial domains (Sine and Circle) and the two real domains (Elec2 and Spam), again, each with nine different realizations and classification results taken as the mean over all presented test sets averaged by 20 runs. Figure 4 presents the results.
Fig. 4

Experiments with nine different percentages of labeled patterns in each data set for the domains Sine, Circle, Elec2 and Spam

As can be seen in Figs. 4(a)–(b), the proposed algorithm presents a large advantage on accuracy performance over the other algorithms, considering all the experiment configurations. This advantage, though, seems to decrease when real data sets are considered, which may be due to the fact that artificial domains are constructed in a controlled manner and therefore present some desired characteristics. On the other hand, real data sets present an unorganized scenario, e.g. concept drifts are not well-defined as in the artificial domains. Bearing this in mind, the advantage presented by the proposed algorithm in the artificial domains can be partially explained due to the KAOGINCSSL ability to get rid of past concepts much faster than the ensemble algorithms used for comparison. Now, for better analyzing the real domains, consider also the standard deviation among the presented test sets and taken from a single run to best represent a real situation. Figure 5 presents the standard deviation for the nine considered experiments and the four compared algorithms when processing the real domains Elec2 and Spam.
Fig. 5

Standard deviations for the four compared algorithms when processing the real domains (Elec2 and Spam) within the nine different percentage of labeled patterns in each data set

In real applications, low variance or standard deviation is a desirable feature for a classifier, precisely the lower the standard deviation the more reliable is the classifier performance. Therefore, considering the results for the electricity domain presented in Fig. 4(c); except for the WCEA algorithm, all the others have presented a similar performance, in special for low levels of labeling (<40 %). Here, again the proposed algorithm obtained the best performances for the experiments with more than 50 % of labeled data. Analyzing the standard deviation in Fig. 5(a), it is easy to verify that the proposed algorithm presents the most reliable performance. The DWM-NB algorithm presents too higher values of standard deviation indicating high fluctuation in classification performance, in spite of presenting good average accuracy. The SEA ensemble has good accuracy results and low variance.

Regarding the results of the KAOGINCSSL algorithm in the Spam base shown in Fig. 4(d), at a first glance, almost the same trend as in the Elec2 domain can be observed. Because it has presented best average accuracy performance for experiments with more than 50 % labeled patterns and average performance for the rest. In spite of that, the KAOGINCSSL algorithm shows again the most regular performance as depicted in Fig. 5(b). The DWM-NB algorithm has also performs well, particularly up to the point where labeled data instances fall off from least than 40 %, but with higher standard deviation than the KAOGINCSSL. Thus, we can say that both KAOGINCSSL and DWM-NB algorithms perform similarly. The SEA ensemble presents the lowest average accuracy but small standard deviation, while the WCEA instead of presenting near average mean accuracy also shows too high standard deviation, which discourages both to be used in this domain.

It is also important to notice that all the algorithms, which have used the KAOGSS as transduction algorithm, present good results, especially in the real domains. Therefore, we verify that the proposed transduction algorithm KAOGSS, not only can be used in association to the KAOGINCSSL algorithm, but also can be successfully used in other algorithms as well.

5 Conclusions

This paper has introduced a semi-supervised graph-based algorithm suitable for non-stationary streaming application, particularly when only a small portion of the acquired data presents label. Comparative results on artificial and real data sets performed on the proposed method against three well-know ensemble methods show that the proposed algorithm outperformed the compared algorithms in most of the experiments. Moreover, the results show that the present spreading label technique can be used successfully in other supervised learning algorithms to support semi-supervised classification. Future work includes testing the proposed algorithm with more data sets and comparing to other algorithms with their own spreading label method, as well as comparing the accuracy of the optimal graph as a transductive method against other transductive ones.

Notes

Declarations

Acknowledgements

This work is supported by the Brazilian National Research Council (CNPq) and by the São Paulo State Research Foundation (FAPESP).

Authors’ Affiliations

(1)
Instituto de Ciências Matemáticas e de Computação, USP

References

  1. Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15:1373–1396MATHView ArticleGoogle Scholar
  2. Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res 1:1–48MathSciNetGoogle Scholar
  3. Bertini JR Jr, Lopes A, Motta R, Zhao L (2010) Online classifier based on the optimal K-associated network. In: Proceedings of the joint conference, III international workshop on web and text intelligence (WTI’10), pp 826–835Google Scholar
  4. Bertini JR Jr, Zhao L, Motta R, Lopes A (2011) A nonparametric classification method based on K-associated graphs. Inf Sci 181:5435–5456MathSciNetView ArticleGoogle Scholar
  5. Bornholdt S, Schuster H (eds) (2003) Handbook of graphs and networks: from the genome to the Internet, 1st edn. Wiley-VCH, WeinheimGoogle Scholar
  6. Breve FA, Zhao L, Quiles M, Pedrycz W, Liu J (2011) Particle competition and cooperation in networks for semi-supervised learning. IEEE Trans Knowl Data Eng. doi:10.1109/TKDE.2011.119Google Scholar
  7. Chapelle O, Zien A, Schölkopf B (eds) (2006) Semi-supervised learning, 1st edn. MIT Press, CambridgeGoogle Scholar
  8. Chapelle O, Sindhwani V, Keerthi S (2008) Optimization techniques for semi-supervised support vector machines. J Mach Learn Res 9:203–233MATHGoogle Scholar
  9. Cormen T, Leiserson C, Rivest R, Stein C (2009) Introduction to algorithms, 3rd edn. MIT Press, CambridgeMATHGoogle Scholar
  10. Culp M, Michailidis G (2008) Graph-based semisupervised learning. IEEE Trans Pattern Anal Mach Intell 30(1):174–179View ArticleGoogle Scholar
  11. Ditzler G, Polikar R (2011) Semi-supervised learning in nonstationary environments. In: Proceedings of international joint conference on neural networks (IJCNN’11), San Jose, CA, USA. IEEE Press, New York, pp 2741–2748View ArticleGoogle Scholar
  12. Erman J, Mahanti A, Arlitt M, Cohen I, Williamson C (2007) Offline/realtime traffic classification using semi-supervised learning. Perform Eval 64:1194–1213View ArticleGoogle Scholar
  13. Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: Proceedings of the Brazilian symposium on artificial intelligence (SBIA’04), vol 3171. Springer, Berlin, pp 286–295Google Scholar
  14. Giraud-Carrier C (2000) A note on the utility of incremental learning. AI Commun 13(4):215–223MATHGoogle Scholar
  15. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference and prediction, 2nd edn. Springer, BerlinView ArticleGoogle Scholar
  16. Hettich S, Bay S (1999) The UCI KDD archive. University of California, Irvine, School of Information and Computer Sciences. http://kdd.ics.uci.edu/
  17. Kelly M, Hand D, Adams N (1999) The impact of changing populations on classifier performance. In: Proceedings of the international conference on knowledge discovery and data mining (KDD’99). ACM, New York, pp 367–371Google Scholar
  18. Klinkenberg R, Joachims T (2000) Detecting concept drift with support vector machines. In: Proceedings of the international conference on machine learning (ICML’00). Morgan Kaufmann, San Mateo, pp 487–494Google Scholar
  19. Kolter JZ, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8:2755–2790MATHGoogle Scholar
  20. Li P, Wu X, Hu X (2010) Mining recurring concept drift with limited labeled streaming data. In: JLMR: workshop and conference proceedings, vol 13, pp 241–252Google Scholar
  21. Lopes AA, Bertini JR Jr, Motta R, Zhao L (2009) Classification based on the optimal k-associated network. In: Proceedings of the international conference on complex sciences: theory and applications (COMPLEX’09). Lecture notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering (LNICST), vol 4. Springer, Berlin, pp 1167–1177Google Scholar
  22. Masud M, Gao J, Khan L, Han J (2008) A practical approach to classify evolving data streams: training with limited amount of labeled data. In: Proceeding of the international conference on data mining (ICDM’08)Google Scholar
  23. Minku L, White A, Yao X (2010) The impact of diversity on online ensemble learning in the presence of concept drift. IEEE Trans Knowl Data Eng 22:730–742View ArticleGoogle Scholar
  24. Narasimhamurthy A, Kuncheva L (2007) A framework for generating data to simulate changing environments. In: Proceedings of the international artificial intelligence and applications (ICAIA’07), pp 384–389Google Scholar
  25. Quiles M, Zhao L, Alonso RL, Romero RAF (2008) Particle competition for complex network community detection. Chaos 18:033107MathSciNetView ArticleGoogle Scholar
  26. Quinlan JR (1993) C4.5 programs for machine learning, 1st edn. Morgan Kaufmann, San MateoGoogle Scholar
  27. Schaeffer S (2007) Graph clustering. Comput Sci Rev 1:27–34MATHView ArticleGoogle Scholar
  28. Schlimmer J, Granger R (1986) Beyond incremental processing: tracking concept drift. In: Proceedings of the association for the advancement of artificial intelligence (AAAI’86). AAAI Press, Menlo Park, pp 502–507Google Scholar
  29. Street N, Kim Y (2001) A streaming ensemble algorithm (SEA) for large-scale classification. In: Proc int’l conf knowledge discovery and data mining (KDD’01). ACM, New York, pp 377–382Google Scholar
  30. Sung J, Kim D (2009) Adaptive active appearance model with incremental learning. Pattern Recognit Lett 30:359–367View ArticleGoogle Scholar
  31. Syed N, Liu H, Sung K (1999) Handling concept drift in incremental learning with support vector machines. In: Proceedings of the international conference on knowledge discovery and data mining (KDD’99), pp 272–276Google Scholar
  32. von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17:395–416MathSciNetView ArticleGoogle Scholar
  33. Wang H, Fan W, Yu P, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proc international conference on knowledge discovery and data mining (KDD’03), pp 226–235Google Scholar
  34. Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach Learn 23(1):69–101Google Scholar
  35. Yang C, Zhou J (2008) Non-stationary data sequence classification using online class priors estimation. Pattern Recognit 41:2656–2664MATHView ArticleGoogle Scholar
  36. Yu Y, Guo S, Lan S, Ban T (2008) Anomaly intrusion detection for evolving data stream based on semi-supervised learning. In: Proceedings of the international conference on advances in neuro-information processing (NIPS’08), pp 571–578Google Scholar
  37. Zhang P, Zhu X, Guo L (2009) Mining data streams with labeled and unlabeled training examples. In: Proceedings of the ninth IEEE international conference on data mining (ICDM’09). IEEE Press, New York, pp 627–636View ArticleGoogle Scholar
  38. Zhu X (2008) Semi-supervised learning literature survey. Tech Rep 1530, Computer-Science, University of Wisconsin-MadisonGoogle Scholar
  39. Zhu X (2005) Semi-supervised learning with graphs. Tech Rep Doctoral Thesis, School of Computer Science, Carnegie Mellon UniversityGoogle Scholar

Copyright

© The Brazilian Computer Society 2012