The semisupervised Kassociated graph, proposed here, consists of a modification of the Kassociated graph [4] to deal with both labeled and unlabeled data during the graph construction procedure. Therefore, in order to introduce the semisupervised version, a brief revision of the Kassociated graph is presented in Sect. 3.1. It is followed by the semisupervised Kassociated graph construction presented in detail in Sect. 3.2. Both supervised and semisupervised Kassociated optimal graphs can be seen as the training process for the KAOG classifier which uses the components of the graph and their purities to classify new data instances, as will be exposed in Sect. 3.3.
The Kassociated graph and the Kassociated optimal graph
A Kassociated graph is constructed from a vectorbased data set X={x_{1},…,x_{
N
}} by representing each data instance x_{
i
}=(x_{i1},x_{i2},…,x_{
ip
},c_{
i
}) as a vertex v_{
i
} with its associated class label c_{
i
}, where c_{
i
}∈Ω={ω_{1},ω_{2},…,ω_{
M
}} and M is the number of classes in the problem. The graph construction resembles to a KNN graph, due to the use of a predefined number of neighbors, K, that each vertex must connect. Although the Kassociated graph does differ from the KNN approach by the fact that amongst the possible K neighbors of a vertex v_{
i
}, it can only be connected to neighbors of the same class as v_{
i
}. Hence, we consider the labelindependent and the labeldependent Kneighborhood of vertex v_{
i
}. The former is simply the set of vertices that represents the K nearest neighbors of the instance x_{
i
} according to a given measure and will be noted by \(\varLambda_{v_{i},K}\). The latter comprises only the vertices with the same class as v_{
i
} among its K nearest neighbors, and is defined as \(\varDelta _{v_{i},K} = \{v_{j}\mid v_{j} \in\varLambda_{v_{i},K}\ \mathrm{AND}\ c_{i} = c_{j}\}\).
In a formal way, the Kassociated graph is defined as a directed graph G=(V,E) which consists of a set of labeled vertices V and a set of edges E between them, where an edge e_{
ij
}=(v_{
i
},v_{
j
}) connects vertex v_{
i
} with vertex v_{
j
} if and only if \(v_{j} \in \varDelta _{v_{i},K}\). As a consequence, only vertices of the same class can be connected. The resulting Kassociated graph can be viewed as a set of disjoint subgraphs or componentsC={C_{1},…,C_{
α
},…,C_{
R
}}. Each component C_{
α
} is composed by vertices of a single class, thus each component represents a single class, which we refer to the label of component C_{
α
} as \(\hat{C}_{\alpha}\). The number of components R varies according to the magnitude of K, but always lies in the range N≥R≥M, with N being the number of vertices in the training set and M the number of classes. Higher values of K induce fewer and larger components in the constructed graph, while lower values lead to small sized ones. This wire mechanism leads to a graph with some important features: (i) By varying K, different graphs can be generated, and as the value of K increases, the number of components decreases monotonically to the number of classes. (ii) The total number of edges among the vertices of a component C_{
α
} is proportional to K and can be at most equal to KN_{
α
}, where N_{
α
} is the number of vertices in component C_{
α
}. (iii) This maximum value is only achieved if all vertices in the neighborhood of any vertex of the component have the same class. Likewise, nearby vertices of other classes decrease the number of connections of the given component. Thus, one can define a measure of “purity” for components, as explained ahead.
Let the degree d_{
i
} of a vertex v_{
i
} be defined as the sum of the connections it receives (indegree) and the connections it performs (outdegree) to other vertices, so \(d_{i} = d_{i}^{\mathrm{in}} + d_{i}^{\mathrm {out}}\). Also, consider the average degree taken for component C_{
α
} be defined by \(D_{\alpha} = 1/N_{\alpha} \sum_{v_{i} \in C_{\alpha}} d_{i}\). According to the way that the Kassociated graph is constructed, a vertex can perform at most K connections, thus, the maximal total outdegree of component C_{
α
} is KN_{
α
}; symmetrically, the total indegree is also KN_{
α
}, resulting in average degree being equal to 2K. Hence, a key idea is to use the ratio defined in Eq. (1) as a measure of “purity” for component C_{
α
}, because it quantifies how intertwined a component is with vertices of other classes,
$$ \varPhi_{\alpha} = \frac{D_{\alpha}}{2K}$$
(1)
In this way, Φ_{
α
}=1, if and only if, for every v_{
i
} in the component C_{
α
}, all the K neighbors have the same class label of v_{
i
}. On the other hand, if there exists noise or two or more classes are mixed together, vertices in this region are unable to make their K connections due to the existence of vertices of other classes in the neighborhood of some vertices. In the latter case, the more mixing the components are, the lower their average degrees D_{
α
} and consequently their respective purities Φ_{
α
} are.
Clearly, the structure of a Kassociated graph depends on the value of K and on the nature of the input data set. Also, Kassociated graphs formed with different K will present different components with different purity values. Bearing this in mind, a suggestive idea is to obtain a graph with the best organization of components without using a unique value of K, i.e., each component has its own optimal value of K, denoted as K_{
α
} for component C_{
α
}. Therefore, the rationale for obtaining the optimal graph is to construct K=1,…,K_{max} associated graphs while keeping the best components found at each K throughout this process. Let β also be an index of component, therefore, a component \(C_{\beta }^{(K+z)}\) from the (K+z)associated graph will replace all components from the Kassociated graph that satisfy Eq. (2), for some integer z≥1 and (z+K)≤K_{max},
$$ \varPhi^{(K+z)}_{\beta} \ge\varPhi^{(K)}_{\alpha}\quad \mbox{for all}\ C_{\alpha}^{(K)} \subseteq C_{\beta}^{(K+z)}$$
(2)
The optimal graph improves the representation of the training set and provides the best configuration of components according to their purities. It corresponds to the best graph organization regarding the purity measure.
The semisupervised Kassociated optimal graph
Consider now obtaining the optimal graph from a partially labeled set X. It is easy to see that it is not possible to obtain the aforementioned graph through the previous description due to the presence of unlabeled patterns. Therefore, we propose here the semisupervised construction of the Kassociated optimal graph.
The problem addressed here regards the absence of enough labeled data in a given data set to employ a regular supervised method. Therefore, it is necessary to consider a semisupervised method in order to induce a classifier from both labeled and unlabeled patterns. Hence, consider the data set X={(x_{1},c_{1}),…,(x_{
l
},c_{
l
}),x_{l+1},…,x_{
N
}} with l labeled patterns (x_{
i
},c_{
i
}) and N−l unlabeled patterns x (or (x_{
j
},∅)). As its supervised counterpart, the semisupervised Kassociated optimal graph construction involves creating a sequence of semisupervised Kassociated graphs. The main difference between the supervised and semisupervised Kassociated graphs can be stated in relation to the set of neighbors, to which each vertex connects. Instead of considering only the labeldependent set (\(\varDelta _{v_{y},K}\)), here, each vertex v_{
i
} connects to all vertices in the set \(\varGamma_{v_{i},K} = \{v_{j} \mid v_{j} \in\varLambda_{v_{i},K}\ \mathrm{AND}\ (c_{j} = c_{i}\ \mathrm{OR}\ c_{i} =\emptyset\ \mathrm{OR}\ c_{j} = \emptyset) \}\). This set encompasses the K nearest neighbors of v_{
i
} whose classes are not different from the class of v_{
i
}. This means that, among its K nearest neighbors, v_{
i
} connects to those vertices which belong to the same class of v_{
i
} or to those with no label. If v_{
i
} itself does not have a class label, it connects to all the K nearest neighbors without considering their classes.
As a consequence of connecting unlabeled vertices to labeled vertices regardless to their classes, components with more than one class may be formed. However, having components with more than one class precludes the classifier to make decisions. In other words, each component must be formed by vertices belonging to a single class and different from null. Thus, to overcome this problem, we propose splitting those components with vertices associated to two or more classes. For splitting a component, the rationale is to cut a few edges in order to end up with separated wellconnected clusters of vertices. In other words, this is a mincut problem, which can be resolved, for example, by the Ford–Fulkerson algorithm [9] for the twoclass case. However, as we consider multiclass classification, there exists the problem that a component might be composed by vertices from more than two classes. For this reason, we propose cutting the component based on the purity of vertex, defined as d_{
i
}/2K, where d_{
i
} stands for the degree of v_{
i
}. Again, consider W_{i,j} the distance, used to construct the graph, between patterns x_{
i
} and x_{
j
}. Thus the proposed separation approach consists of successive removing the edge with minimum value of cut from the component, as defined in Eq. (3). Otherwise stated, the next edge to be removed (v_{
a
},v_{
b
})∈C_{
α
} must satisfy cut_{a,b}=min(cut_{i,j}) ∀(v_{
i
},v_{
j
})∈C_{
α
}.
$$ \mathrm{cut}_{i,j} = \min \biggl(\frac{d_i}{2K},\frac{d_j}{2K} \biggr)\frac{1}{W_{i,j}}$$
(3)
The cutting process in the component C_{
α
} finishes until it is separated into single class components. The rationale behind the criterion is that by cutting the edges that connects low purity vertices and whose respective patterns are distant from each other, it is more likely to obtain separated wellconnected components. In fact, low purity vertices are usually found in boundary regions between components of different classes in supervised tasks. However, in the semisupervised scenario, purity itself can be a misleading measure due to high connection probability of the unlabeled vertices. Therefore the distance weight in Eq. (3) favors cutting the edges with highest distance in the component.
Algorithm 1 details the construction of the semisupervised Kassociated optimal graph. The function findComponents() determines the graph components by implementing a breadthfirst search [9]. Then, the components having vertices belonging to more than one class are separated by the function splitNonSingleClassComponents(). This function implements the cutting procedure described earlier and returns two or more single class components, which can include components without a class label. The next step consists of spreading the labels within every component by calling the function spreadLabel(). After this stage, all vertices in any given component are labeled with a single class label or are unlabeled. To finish the Kassociated graph construction, the purity measure is calculated through the function purity() for all components. At the end of this process, if K=1, then the graph generated so far is the optimal graph and it is assigned to \(G^{(\mathrm {opt})}_{s}\). Otherwise, each component of the current Kassociated graph, \(G^{(K)}_{s}\), is compared to the components in the graph, \(G^{(\mathrm{opt})}_{s}\), having the same vertices (condition in line 20). The new component will substitute the corresponding old ones if the purity is increased or maintained for the labeled components. The process goes on by increasing K and generating a new Kassociated graph until the number of components in this new graph matches the number of classes in the problem (R=M).
In summary, the main modifications in the original Kassociated optimal graph construction algorithm [4] include connecting each vertex to all its neighbors with the same class or without a class label (line 6) and merging every component with empty class (in the Algorithm 1, \(\hat{C}_{\alpha}\) stands for the class of a component) to another component, independent of purity. Notice that the present algorithm not only can construct the Kassociated optimal graph, but also, by doing so, can spread the labels throughout the whole training set. Therefore, the KAOGSS algorithm is a transductive method.
The KAOG classifier
This section presents the nonparametric classifier that uses the Kassociated optimal graph structure to infer the class of new patterns, for more details, please refer to Ref. [4]. In order to present how a new pattern is classified, consider again a training pattern x_{
i
} represented by x_{
i
}=(x_{i1},x_{i2},…,x_{
ip
},c_{
i
}), which x_{
i
} represents the ith training pattern with c_{
i
} its associated class label, in a Mclass problem c_{
i
}∈Ω={ω_{1},ω_{2},…,ω_{
M
}}. In the same way, a new pattern is defined as y=(y_{1},y_{2},…,y_{
p
}), excepted that now its class label must be estimated. Consider also the set of components of the optimal graph C={C_{1},…,C_{
α
},…,C_{
R
}}, where R is the number of components and R≥M. In order to classify the new pattern y, we must firstly transform it to a vertex, noted by v_{
y
}, then connect it into the graph as explained ahead. Consider K_{
L
} the largest value of K in the Kassociated optimal graph, or equivalently the K value from the last obtained component. For every new pattern y, we do:

1.
Calculate the distances between the new pattern y and all elements x_{
i
} in the training set

2.
Find the K_{
L
} nearest neighbors of y; noted in ascending order as \(\bar{\varLambda}_{v_{y},K_{L}} = \{\mathbf{x}_{(1)},\mathbf {x}_{(2)},\ldots,\mathbf{x}_{(k)},\ldots, \mathbf{x}_{(K_{L})}\}\)

3.
For k=1 to K_{
L
}
Once the new vertex v_{
y
} is connected to the Kassociated graph, its class label is estimated using the Bayes theory [15]. The connection established during classification are temporary, i.e. they will not be incorporated into the graph structure. The posterior probability of a new vertex v_{
y
} to belong to component C_{
α
} given the set of labelindependent neighbors of v_{
y
}, noted by \(\varLambda_{v_{y}}\), is defined by Eq. (4),
$$ P(v_y \in C_{\alpha}\mid \varLambda_{v_y}) =\frac{P(\varLambda_{v_y}\mid v_y \in C_{\alpha})P(v_y \in C_{\alpha })}{P(\varLambda_{v_y})}$$
(4)
Knowing that each component C_{
α
} has been formed in a particular Kassociated graph among the various generated graphs, we must consider the particular value of K in which C_{
α
} was formed, noted by K_{
α
}. Let \(\varLambda_{v_{y},K_{\alpha}}\) represent the set of K_{
α
} nearest neighbors of v_{
y
}. Thus, in order to estimate the probability \(P(\varLambda_{v_{y}}\mid v_{y} \in C_{\alpha })\), one must consider the fraction among the connections made with component C_{
α
} over all possible K_{
α
} connections, as shown in Eq. (5),
$$ P(\varLambda_{v_y}\mid v_y \in C_{\alpha}) =\frac{\{\varLambda_{v_y,K_{\alpha}}\}}{K_{\alpha}}$$
(5)
The prior probabilities P(v_{
y
}∈C_{
α
}) are defined as the normalized purities among the components to which v_{
y
} is connected as \(P(v_{y} \in C_{\alpha}) = \varPhi_{\alpha} / \sum_{N_{v_{y},C_{\beta}} \neq 0} \varPhi_{\beta}\), where \(N_{v_{y},C_{\beta}}\) represents the number of connections v_{
y
} has to component C_{
β
}. Accordingly, the normalizing term is given by Eq. (6),
$$ P(\varLambda_{v_y}) = \sum _{N_{v_y,C_{\beta}} \neq0} P(\varLambda_{v_y}\mid v_y \in C_{\beta})P(v_y \in C_{\beta})$$
(6)
In many cases, there are more components than number of classes, according to Bayes optimal classifier, it is necessary to sum the posterior probabilities of all components corresponding to the same class. Finally the largest value among the found posterior probabilities reflects the most probable class for the new pattern, according to Eq. (7), where φ(y) stands for the class attributed for instance y,
$$ \varphi(\mathbf{y} ) = \mathop{\mathrm{arg\,max}} \bigl\{ P (\mathbf{y}\mid \omega_1 ),\dots, P (\mathbf{y}\mid \omega_M )\bigr\}$$
(7)
Classifying partially labeled data stream
This section exposes how the proposed graphbased structure copes with nonstationary classification. Consider a stream S={X_{1},Y_{1},…,X_{
T
},Y_{
T
}}, where X_{
t
}={(x_{1},c_{1}),…,(x_{
l
},c_{
l
}),x_{l+1},…,x_{
N
}} contains labeled and unlabeled patterns; while Y_{
t
}={y_{1},…,y_{
M
}} is formed with unlabeled patterns only. Such streams may present concept drift at any time. Therefore, an online classifier should have the ability to evolve by adding new knowledge along time without being retrained. In the proposed approach, this dynamical evolution is done by considering a dynamic graph, named principal graph, which grows with the frequent addition of components provided by the Kassociated optimal graph formation (Algorithm 1) along the data stream processing. Algorithm 2 details the proposed approach.
Algorithm 2 presents the KAOGINCSSL algorithm, which processes a data stream S composed of partially labeled and unlabeled data sets. The function nextChunk(S) removes the next set from stream S and put it into the variable Z used to represent a chunk of data. After assigning the next set to the variable Z, the algorithm determines if the set is partially labeled to be considered for training/updating (i.e. if the set has enough labeled patterns, e.g., at least 5 %) through the function isPartiallyLabeled(Z) which returns “true” if Z is partially labeled and “false” otherwise.
Therefore, the tasks of the algorithm are twofold, (i) incorporate new knowledge from both labeled and unlabeled patterns to subdue concept drift and (ii) predict the label for the unlabeled patterns presented in unlabeled sets. In the former task, the objective is to incorporate new knowledge from the recent obtained partially labeled set, thus a semisupervised Kassociated optimal graph is derived using Algorithm 1 (KAOGSS). As explained in Sect. 3.2, the KAOGSS algorithm generates the Kassociated optimal graph spreading the labels to all vertices and the resulting graph is composed of several disjoint components. These new components are then merged to the principal graph (G_{
P
}), which is composed of independent components. However, the addition of new components increases the size of the principal graph, which may increment classification error and time. To avoid this problem, the principal graph should not grow unlimitedly, thus, old and unused components should be removed.
The task of classifying new patterns takes place if the set at hand is unlabeled, and it is resolved by simply applying the KAOG classifier using the principal graph to classify unlabeled vertices, as presented in Sect. 3.3. Component removal takes place during classification phase by applying a method named disuse rule. This rule establishes a maximum number of consecutive classifications in which a component is allowed to be unused (i.e. do not receive any connections during classification). The maximum value accepted is set by the parameter τ. When a component remains out of use after τ patterns are classified, it is removed from the principal graph. The algorithm finishes when the whole stream has been processed, i.e. S=∅.
An important feature of stream classification algorithms is its ability to process data in a reasonable time, which includes the tasks of training, updating and classifying. The proposed algorithm consists of the following phases of data processing: (i) training or updating the principal graph, (ii) classifying new data and, (iii) removing unused components.
In the first phase, training or updating the principal graph is required whenever a partially labeled set X is presented. Let there be N instances in the set X; training (or updating) corresponds to build a semisupervised Kassociated optimal graph (Algorithm 1). As estimated in Ref. [4], the complexity order to build a supervised Kassociated optimal graph is about O(N^{2})—due to distance matrix calculation. Also, it has been shown that the Kassociated optimal graph construction scales better than the C4.5 and the Gibbs Sampling algorithms. Taking into account that the only addition in processing time in the semisupervised version is the need to verify whether a component presents more than one class and, in this case, the algorithm cuts out some edges to divide it into some single class components. Knowing that the process of finding and cutting a component by using the proposed technique depends on the number of edges and vertices in the component (O(N_{
α
}+E_{
α
})), where N_{
α
} and E_{
α
} are the number of vertices and edges in the component C_{
α
}, respectively. Since Kassociated graphs are sparse, thus, N_{
α
} or E_{
α
} is much smaller than the number of vertices in the whole graph. Allied to the fact that few components need to be partitioned (those components, which are composed of vertices from more than one class), it can be verified that the computational order of this phase remains O(N^{2}).
Now we consider the second phase, the order of classifying a new pattern has also been estimated in the aforementioned work as O(N_{
p
}), due to the distance calculation among the new vertex and the N_{
p
} vertices in the principal graph. Here, it is important to mention that there exist strategies for lowering the order, for example locating the nearest components firstly, instead of actually searching for the vertices neighbors. Such strategy decreases the computational cost to O(N_{
cp
}), with N_{
cp
} being the number of components in the principal graph, and N_{
cp
}≪N_{
p
}. At last, component removal can be done by the disuse rule, which is done by simply checking the time parameter of each component, therefore, it has the order of O(N_{
cp
}).