The semi-supervised K-associated graph, proposed here, consists of a modification of the K-associated graph [4] to deal with both labeled and unlabeled data during the graph construction procedure. Therefore, in order to introduce the semi-supervised version, a brief revision of the K-associated graph is presented in Sect. 3.1. It is followed by the semi-supervised K-associated graph construction presented in detail in Sect. 3.2. Both supervised and semi-supervised K-associated optimal graphs can be seen as the training process for the KAOG classifier which uses the components of the graph and their purities to classify new data instances, as will be exposed in Sect. 3.3.
The K-associated graph and the K-associated optimal graph
A K-associated graph is constructed from a vector-based data set X={x1,…,x
N
} by representing each data instance x
i
=(xi1,xi2,…,x
ip
,c
i
) as a vertex v
i
with its associated class label c
i
, where c
i
∈Ω={ω1,ω2,…,ω
M
} and M is the number of classes in the problem. The graph construction resembles to a KNN graph, due to the use of a predefined number of neighbors, K, that each vertex must connect. Although the K-associated graph does differ from the KNN approach by the fact that amongst the possible K neighbors of a vertex v
i
, it can only be connected to neighbors of the same class as v
i
. Hence, we consider the label-independent and the label-dependent K-neighborhood of vertex v
i
. The former is simply the set of vertices that represents the K nearest neighbors of the instance x
i
according to a given measure and will be noted by \(\varLambda_{v_{i},K}\). The latter comprises only the vertices with the same class as v
i
among its K nearest neighbors, and is defined as \(\varDelta _{v_{i},K} = \{v_{j}\mid v_{j} \in\varLambda_{v_{i},K}\ \mathrm{AND}\ c_{i} = c_{j}\}\).
In a formal way, the K-associated graph is defined as a directed graph G=(V,E) which consists of a set of labeled vertices V and a set of edges E between them, where an edge e
ij
=(v
i
,v
j
) connects vertex v
i
with vertex v
j
if and only if \(v_{j} \in \varDelta _{v_{i},K}\). As a consequence, only vertices of the same class can be connected. The resulting K-associated graph can be viewed as a set of disjoint subgraphs or componentsC={C1,…,C
α
,…,C
R
}. Each component C
α
is composed by vertices of a single class, thus each component represents a single class, which we refer to the label of component C
α
as \(\hat{C}_{\alpha}\). The number of components R varies according to the magnitude of K, but always lies in the range N≥R≥M, with N being the number of vertices in the training set and M the number of classes. Higher values of K induce fewer and larger components in the constructed graph, while lower values lead to small sized ones. This wire mechanism leads to a graph with some important features: (i) By varying K, different graphs can be generated, and as the value of K increases, the number of components decreases monotonically to the number of classes. (ii) The total number of edges among the vertices of a component C
α
is proportional to K and can be at most equal to KN
α
, where N
α
is the number of vertices in component C
α
. (iii) This maximum value is only achieved if all vertices in the neighborhood of any vertex of the component have the same class. Likewise, nearby vertices of other classes decrease the number of connections of the given component. Thus, one can define a measure of “purity” for components, as explained ahead.
Let the degree d
i
of a vertex v
i
be defined as the sum of the connections it receives (in-degree) and the connections it performs (out-degree) to other vertices, so \(d_{i} = d_{i}^{\mathrm{in}} + d_{i}^{\mathrm {out}}\). Also, consider the average degree taken for component C
α
be defined by \(D_{\alpha} = 1/N_{\alpha} \sum_{v_{i} \in C_{\alpha}} d_{i}\). According to the way that the K-associated graph is constructed, a vertex can perform at most K connections, thus, the maximal total out-degree of component C
α
is KN
α
; symmetrically, the total in-degree is also KN
α
, resulting in average degree being equal to 2K. Hence, a key idea is to use the ratio defined in Eq. (1) as a measure of “purity” for component C
α
, because it quantifies how intertwined a component is with vertices of other classes,
$$ \varPhi_{\alpha} = \frac{D_{\alpha}}{2K}$$
(1)
In this way, Φ
α
=1, if and only if, for every v
i
in the component C
α
, all the K neighbors have the same class label of v
i
. On the other hand, if there exists noise or two or more classes are mixed together, vertices in this region are unable to make their K connections due to the existence of vertices of other classes in the neighborhood of some vertices. In the latter case, the more mixing the components are, the lower their average degrees D
α
and consequently their respective purities Φ
α
are.
Clearly, the structure of a K-associated graph depends on the value of K and on the nature of the input data set. Also, K-associated graphs formed with different K will present different components with different purity values. Bearing this in mind, a suggestive idea is to obtain a graph with the best organization of components without using a unique value of K, i.e., each component has its own optimal value of K, denoted as K
α
for component C
α
. Therefore, the rationale for obtaining the optimal graph is to construct K=1,…,Kmax associated graphs while keeping the best components found at each K throughout this process. Let β also be an index of component, therefore, a component \(C_{\beta }^{(K+z)}\) from the (K+z)-associated graph will replace all components from the K-associated graph that satisfy Eq. (2), for some integer z≥1 and (z+K)≤Kmax,
$$ \varPhi^{(K+z)}_{\beta} \ge\varPhi^{(K)}_{\alpha}\quad \mbox{for all}\ C_{\alpha}^{(K)} \subseteq C_{\beta}^{(K+z)}$$
(2)
The optimal graph improves the representation of the training set and provides the best configuration of components according to their purities. It corresponds to the best graph organization regarding the purity measure.
The semi-supervised K-associated optimal graph
Consider now obtaining the optimal graph from a partially labeled set X. It is easy to see that it is not possible to obtain the aforementioned graph through the previous description due to the presence of unlabeled patterns. Therefore, we propose here the semi-supervised construction of the K-associated optimal graph.
The problem addressed here regards the absence of enough labeled data in a given data set to employ a regular supervised method. Therefore, it is necessary to consider a semi-supervised method in order to induce a classifier from both labeled and unlabeled patterns. Hence, consider the data set X={(x1,c1),…,(x
l
,c
l
),xl+1,…,x
N
} with l labeled patterns (x
i
,c
i
) and N−l unlabeled patterns x (or (x
j
,∅)). As its supervised counterpart, the semi-supervised K-associated optimal graph construction involves creating a sequence of semi-supervised K-associated graphs. The main difference between the supervised and semi-supervised K-associated graphs can be stated in relation to the set of neighbors, to which each vertex connects. Instead of considering only the label-dependent set (\(\varDelta _{v_{y},K}\)), here, each vertex v
i
connects to all vertices in the set \(\varGamma_{v_{i},K} = \{v_{j} \mid v_{j} \in\varLambda_{v_{i},K}\ \mathrm{AND}\ (c_{j} = c_{i}\ \mathrm{OR}\ c_{i} =\emptyset\ \mathrm{OR}\ c_{j} = \emptyset) \}\). This set encompasses the K nearest neighbors of v
i
whose classes are not different from the class of v
i
. This means that, among its K nearest neighbors, v
i
connects to those vertices which belong to the same class of v
i
or to those with no label. If v
i
itself does not have a class label, it connects to all the K nearest neighbors without considering their classes.
As a consequence of connecting unlabeled vertices to labeled vertices regardless to their classes, components with more than one class may be formed. However, having components with more than one class precludes the classifier to make decisions. In other words, each component must be formed by vertices belonging to a single class and different from null. Thus, to overcome this problem, we propose splitting those components with vertices associated to two or more classes. For splitting a component, the rationale is to cut a few edges in order to end up with separated well-connected clusters of vertices. In other words, this is a min-cut problem, which can be resolved, for example, by the Ford–Fulkerson algorithm [9] for the two-class case. However, as we consider multi-class classification, there exists the problem that a component might be composed by vertices from more than two classes. For this reason, we propose cutting the component based on the purity of vertex, defined as d
i
/2K, where d
i
stands for the degree of v
i
. Again, consider Wi,j the distance, used to construct the graph, between patterns x
i
and x
j
. Thus the proposed separation approach consists of successive removing the edge with minimum value of cut from the component, as defined in Eq. (3). Otherwise stated, the next edge to be removed (v
a
,v
b
)∈C
α
must satisfy cuta,b=min(cuti,j) ∀(v
i
,v
j
)∈C
α
.
$$ \mathrm{cut}_{i,j} = \min \biggl(\frac{d_i}{2K},\frac{d_j}{2K} \biggr)\frac{1}{W_{i,j}}$$
(3)
The cutting process in the component C
α
finishes until it is separated into single class components. The rationale behind the criterion is that by cutting the edges that connects low purity vertices and whose respective patterns are distant from each other, it is more likely to obtain separated well-connected components. In fact, low purity vertices are usually found in boundary regions between components of different classes in supervised tasks. However, in the semi-supervised scenario, purity itself can be a misleading measure due to high connection probability of the unlabeled vertices. Therefore the distance weight in Eq. (3) favors cutting the edges with highest distance in the component.
Algorithm 1 details the construction of the semi-supervised K-associated optimal graph. The function findComponents() determines the graph components by implementing a breadth-first search [9]. Then, the components having vertices belonging to more than one class are separated by the function splitNonSingleClassComponents(). This function implements the cutting procedure described earlier and returns two or more single class components, which can include components without a class label. The next step consists of spreading the labels within every component by calling the function spreadLabel(). After this stage, all vertices in any given component are labeled with a single class label or are unlabeled. To finish the K-associated graph construction, the purity measure is calculated through the function purity() for all components. At the end of this process, if K=1, then the graph generated so far is the optimal graph and it is assigned to \(G^{(\mathrm {opt})}_{s}\). Otherwise, each component of the current K-associated graph, \(G^{(K)}_{s}\), is compared to the components in the graph, \(G^{(\mathrm{opt})}_{s}\), having the same vertices (condition in line 20). The new component will substitute the corresponding old ones if the purity is increased or maintained for the labeled components. The process goes on by increasing K and generating a new K-associated graph until the number of components in this new graph matches the number of classes in the problem (R=M).
In summary, the main modifications in the original K-associated optimal graph construction algorithm [4] include connecting each vertex to all its neighbors with the same class or without a class label (line 6) and merging every component with empty class (in the Algorithm 1, \(\hat{C}_{\alpha}\) stands for the class of a component) to another component, independent of purity. Notice that the present algorithm not only can construct the K-associated optimal graph, but also, by doing so, can spread the labels throughout the whole training set. Therefore, the KAOGSS algorithm is a transductive method.
The KAOG classifier
This section presents the nonparametric classifier that uses the K-associated optimal graph structure to infer the class of new patterns, for more details, please refer to Ref. [4]. In order to present how a new pattern is classified, consider again a training pattern x
i
represented by x
i
=(xi1,xi2,…,x
ip
,c
i
), which x
i
represents the ith training pattern with c
i
its associated class label, in a M-class problem c
i
∈Ω={ω1,ω2,…,ω
M
}. In the same way, a new pattern is defined as y=(y1,y2,…,y
p
), excepted that now its class label must be estimated. Consider also the set of components of the optimal graph C={C1,…,C
α
,…,C
R
}, where R is the number of components and R≥M. In order to classify the new pattern y, we must firstly transform it to a vertex, noted by v
y
, then connect it into the graph as explained ahead. Consider K
L
the largest value of K in the K-associated optimal graph, or equivalently the K value from the last obtained component. For every new pattern y, we do:
-
1.
Calculate the distances between the new pattern y and all elements x
i
in the training set
-
2.
Find the K
L
nearest neighbors of y; noted in ascending order as \(\bar{\varLambda}_{v_{y},K_{L}} = \{\mathbf{x}_{(1)},\mathbf {x}_{(2)},\ldots,\mathbf{x}_{(k)},\ldots, \mathbf{x}_{(K_{L})}\}\)
-
3.
For k=1 to K
L
Once the new vertex v
y
is connected to the K-associated graph, its class label is estimated using the Bayes theory [15]. The connection established during classification are temporary, i.e. they will not be incorporated into the graph structure. The posterior probability of a new vertex v
y
to belong to component C
α
given the set of label-independent neighbors of v
y
, noted by \(\varLambda_{v_{y}}\), is defined by Eq. (4),
$$ P(v_y \in C_{\alpha}\mid \varLambda_{v_y}) =\frac{P(\varLambda_{v_y}\mid v_y \in C_{\alpha})P(v_y \in C_{\alpha })}{P(\varLambda_{v_y})}$$
(4)
Knowing that each component C
α
has been formed in a particular K-associated graph among the various generated graphs, we must consider the particular value of K in which C
α
was formed, noted by K
α
. Let \(\varLambda_{v_{y},K_{\alpha}}\) represent the set of K
α
nearest neighbors of v
y
. Thus, in order to estimate the probability \(P(\varLambda_{v_{y}}\mid v_{y} \in C_{\alpha })\), one must consider the fraction among the connections made with component C
α
over all possible K
α
connections, as shown in Eq. (5),
$$ P(\varLambda_{v_y}\mid v_y \in C_{\alpha}) =\frac{|\{\varLambda_{v_y,K_{\alpha}}\}|}{K_{\alpha}}$$
(5)
The prior probabilities P(v
y
∈C
α
) are defined as the normalized purities among the components to which v
y
is connected as \(P(v_{y} \in C_{\alpha}) = \varPhi_{\alpha} / \sum_{N_{v_{y},C_{\beta}} \neq 0} \varPhi_{\beta}\), where \(N_{v_{y},C_{\beta}}\) represents the number of connections v
y
has to component C
β
. Accordingly, the normalizing term is given by Eq. (6),
$$ P(\varLambda_{v_y}) = \sum _{N_{v_y,C_{\beta}} \neq0} P(\varLambda_{v_y}\mid v_y \in C_{\beta})P(v_y \in C_{\beta})$$
(6)
In many cases, there are more components than number of classes, according to Bayes optimal classifier, it is necessary to sum the posterior probabilities of all components corresponding to the same class. Finally the largest value among the found posterior probabilities reflects the most probable class for the new pattern, according to Eq. (7), where φ(y) stands for the class attributed for instance y,
$$ \varphi(\mathbf{y} ) = \mathop{\mathrm{arg\,max}} \bigl\{ P (\mathbf{y}\mid \omega_1 ),\dots, P (\mathbf{y}\mid \omega_M )\bigr\}$$
(7)
Classifying partially labeled data stream
This section exposes how the proposed graph-based structure copes with non-stationary classification. Consider a stream S={X1,Y1,…,X
T
,Y
T
}, where X
t
={(x1,c1),…,(x
l
,c
l
),xl+1,…,x
N
} contains labeled and unlabeled patterns; while Y
t
={y1,…,y
M
} is formed with unlabeled patterns only. Such streams may present concept drift at any time. Therefore, an online classifier should have the ability to evolve by adding new knowledge along time without being retrained. In the proposed approach, this dynamical evolution is done by considering a dynamic graph, named principal graph, which grows with the frequent addition of components provided by the K-associated optimal graph formation (Algorithm 1) along the data stream processing. Algorithm 2 details the proposed approach.
Algorithm 2 presents the KAOGINCSSL algorithm, which processes a data stream S composed of partially labeled and unlabeled data sets. The function nextChunk(S) removes the next set from stream S and put it into the variable Z used to represent a chunk of data. After assigning the next set to the variable Z, the algorithm determines if the set is partially labeled to be considered for training/updating (i.e. if the set has enough labeled patterns, e.g., at least 5 %) through the function isPartiallyLabeled(Z) which returns “true” if Z is partially labeled and “false” otherwise.
Therefore, the tasks of the algorithm are twofold, (i) incorporate new knowledge from both labeled and unlabeled patterns to subdue concept drift and (ii) predict the label for the unlabeled patterns presented in unlabeled sets. In the former task, the objective is to incorporate new knowledge from the recent obtained partially labeled set, thus a semi-supervised K-associated optimal graph is derived using Algorithm 1 (KAOGSS). As explained in Sect. 3.2, the KAOGSS algorithm generates the K-associated optimal graph spreading the labels to all vertices and the resulting graph is composed of several disjoint components. These new components are then merged to the principal graph (G
P
), which is composed of independent components. However, the addition of new components increases the size of the principal graph, which may increment classification error and time. To avoid this problem, the principal graph should not grow unlimitedly, thus, old and unused components should be removed.
The task of classifying new patterns takes place if the set at hand is unlabeled, and it is resolved by simply applying the KAOG classifier using the principal graph to classify unlabeled vertices, as presented in Sect. 3.3. Component removal takes place during classification phase by applying a method named disuse rule. This rule establishes a maximum number of consecutive classifications in which a component is allowed to be unused (i.e. do not receive any connections during classification). The maximum value accepted is set by the parameter τ. When a component remains out of use after τ patterns are classified, it is removed from the principal graph. The algorithm finishes when the whole stream has been processed, i.e. S=∅.
An important feature of stream classification algorithms is its ability to process data in a reasonable time, which includes the tasks of training, updating and classifying. The proposed algorithm consists of the following phases of data processing: (i) training or updating the principal graph, (ii) classifying new data and, (iii) removing unused components.
In the first phase, training or updating the principal graph is required whenever a partially labeled set X is presented. Let there be N instances in the set X; training (or updating) corresponds to build a semi-supervised K-associated optimal graph (Algorithm 1). As estimated in Ref. [4], the complexity order to build a supervised K-associated optimal graph is about O(N2)—due to distance matrix calculation. Also, it has been shown that the K-associated optimal graph construction scales better than the C4.5 and the Gibbs Sampling algorithms. Taking into account that the only addition in processing time in the semi-supervised version is the need to verify whether a component presents more than one class and, in this case, the algorithm cuts out some edges to divide it into some single class components. Knowing that the process of finding and cutting a component by using the proposed technique depends on the number of edges and vertices in the component (O(N
α
+E
α
)), where N
α
and E
α
are the number of vertices and edges in the component C
α
, respectively. Since K-associated graphs are sparse, thus, N
α
or E
α
is much smaller than the number of vertices in the whole graph. Allied to the fact that few components need to be partitioned (those components, which are composed of vertices from more than one class), it can be verified that the computational order of this phase remains O(N2).
Now we consider the second phase, the order of classifying a new pattern has also been estimated in the aforementioned work as O(N
p
), due to the distance calculation among the new vertex and the N
p
vertices in the principal graph. Here, it is important to mention that there exist strategies for lowering the order, for example locating the nearest components firstly, instead of actually searching for the vertices neighbors. Such strategy decreases the computational cost to O(N
cp
), with N
cp
being the number of components in the principal graph, and N
cp
≪N
p
. At last, component removal can be done by the disuse rule, which is done by simply checking the time parameter of each component, therefore, it has the order of O(N
cp
).