An adaptive kernel width convex combination method for maximum correntropy criterion

Recently, the maximum correntropy criterion (MCC) has been successfully applied in numerous applications regarding nonGaussian data processing. MCC employs a free parameter called kernel width, which affects the convergence rate, robustness, and steady-state performance of the adaptive filtering. However, determining the optimal value for such parameter is not always a trivial task. Within this context, this paper proposes a novel method called adaptive convex combination maximum correntropy criterion (ACCMCC), which combines an adaptive kernel algorithm with convex combination techniques. ACCMCC takes advantage from a convex combination of two adaptive MCC-based filters, whose kernel widths are adjusted iteratively as a function of the minimum error value obtained in a predefined estimation window. Results obtained in impulsive noise environment have shown that the proposed approach achieves equivalent convergence rates but with increased accuracy and robustness when compared with other similar algorithms reported in literature.


Introduction
Correntropy is a similarity measure based on Rényi entropy capable of extracting highorder statistical information from data, thus generalizing the correlation concept [1]. In the past few years, this concept been successfully applied in the solution of various engineering problems, such as the kernel Kalman filtering [2], adaptive time delay estimation [3], automatic modulation classification [4,5], nonlinear auto-regressive with exogenous (NARX) for identification system [6], compressive sensing problem in an impulsive noise environment [7], signal detector in MIMO systems [8], and fuzzy neural system [9]. Training adaptive systems with cost functions such as the error correntropy criterion requires the selection of a proper kernel width. This parameter essentially controls the nature of the performance surface over which the system parameters are adapted. Therefore, it has important effects on distinct aspects e.g. convergence speed, presence of local optima, and stability of weight tracks [10]. The maximum correntropy criteria (MCC) uses correntropy as a cost function in optimization problems such as adaptive filtering. It has been shown in literature that MCC is able to achieve better performance than second-order methods [11][12][13][14][15]. Clearly, the use of a fixed value for the kernel width to estimate the error correntropy throughout the adaptation process would be suboptimal. However, several works fix this parameter imposing tradeoffs among convergence rate, robustness, and steady-state accuracy [10].
A series of adaptive kernel width algorithms has been proposed in order to properly choose iteratively the value for this parameter in optimization problems. An algorithm called adaptive kernel width MCC (AMCC) is proposed in [16] aiming to improve the learning speed, especially when the initial weight vector is far from being optimal. Another method called switch kernel width method of correntropy (SMCC) in [17] updates the kernel width based on the instantaneous error between the estimation and the desired signal in order to adjust such parameter for each iterations. Recently, it has been shown the technique developed in [18] called variable kernel width (VKW-MCC) is able to search for the best kernel width at each iteration, implying reduced error. This strategy is able to provide fast convergence rate and stable steady-state performance. On the other hand, the work presented in [19] report a novel robust adaptive algorithm associated to the convex combination of maximum correntropy (CCMC) with MCC employing distinct step sizes and fixed kernel width to satisfy the conflicting requirements between convergence rate and the steady-state mean square error.
Within this context, this work proposes an algorithm called adaptive convex combination kernel width method for maximum correntropy criterion (ACCMCC), which combines two MCC-based adaptive algorithms with different step sizes while adjusting the kernel size. The introduced approach is based on the gradient method, as two learning rates are used and then combined in order to keep fast convergence rate and small error. Simulation results are presented to validate the performance of ACCMCC method compared with other similar algorithms previously reported in literature, e.g., MCC, VKW-MCC, SMCC, AMCCC, and CMCC.
The remainder of the paper is organized as follows. the "Methods" section reviews the correntropy function concepts. The "Adaptive kernel algorithms" section describes in detail each kernel adaptive algorithm evaluated in this work. The "Adaptive convex combination kernel width MCC (ACCMCC)" section proposes an algorithm called adaptative convex combination kernel width for maximum correntropy criterion. The "Impulsive noise" section shows the noise model used by the simulations, while the "Results and discussion" section presents the performance of the proposed method compared with other similar solutions. Finally, the "Conclusion" section discusses the main contributions and results regarding the study carried out in this work.

Correntropy
Correntropy is a similarity measure between two arbitrary signals or random variables. Given two random variables X and Y, correntropy is defined by [1]  where E[ ·] denote the expectation operator, k σ (· , ·) is any positive definite kernel with bandwidth σ . In this work, the kernel function is a Gaussian kernel given by where σ corresponds to the kernel bandwidth, also referred to as the kernel size or kernel width. Using the Taylor series expansion of the Gaussian function in (1), it is possible to obtain: By using the Gaussian kernel, the correntropy function becomes a weighted sum of all the even moments from a random variable (X − Y ). It is interesting to observe that the kernel size in Eq.
(3) appears as a weighting parameter that controls which moments are effectively used. For sufficiently large values, the second-order moments are dominant and the measure gets closer to the correlation [20].

Maximum correntropy criterion (MCC)
Due to the well-know ability of correntropy to deal with outliers, it is often used as a cost function in regression problems [1]. The goal is to maximize the similarity between a given desired signal, d i , and an estimated output, y i = w T x i , where w is the estimated weight vector and x i the input vector. In many cases, only a finite number of data is available, as it is possible to employ a sample mean estimator of correntropy [20]: By using the Gaussian kernel, a cost function J can be expressed as: Then, by maximizing (5) associated to a stochastic gradient ascent, it is possible to obtain an update rule for weights w in the form: where e k = d k − y k = d k − w T x k is the error in the kth algorithm iteration and μ is the step or learning rate parameter obtained from the stochastic gradient ascent method.

Adaptive kernel algorithms
In most cases, replacing the mean square error for MCC as a cost function makes the algorithm robust to outliers. On the other hand, another parameter is introduced, i.e., the kernel width, which is an inherent characteristic to correntropy. The kernel size works as a scale parameter that controls the width of the Gaussian kernel used by correntropy, being directly associated to the steady-state performance, convergence rate, and impulsive noise rejection. Since it is a free parameter, the kernel width must be chosen by the user, whose value changes according to data and application nature. Then, the definition of an optimal value for the kernel width is not an easy task [20]. In this context, many works have been proposed in order to help determining the optimal kernel width, e.g., [16][17][18]21]. This work introduces a novel algorithm to address this issue, as the forthcoming section is dedicated to the detailed discussion of some methods that eventually led to the conception of the proposed approach.

Adaptive kernel width MCC (AMCC)
According to [16], AMCC consists in selecting the kernel as a combination of a fixed kernel bandwidth, which could be defined using the Silverman rule [22], and the prediction error e k as expressed by: The authors in [16] showed that this approach makes the algorithm to converge faster, especially when the initial weight vector is far away from the optimal one. Besides the fast convergence rate, prominent advantages of the method lie in simplicity, no extra computational cost, and no extra free parameters are required.

Switch kernel width MCC (SMCC)
In order to improve the convergence rate of the method, the switch kernel width MCC (SMCC) algorithm introduced in [17] proposes a maximum adaptive kernel between the absolute value of the instantaneous error divided by two and a fixed kernel width, i.e., where σ is a predetermined value which can be obtained using the Silverman rule [22] or any similar method or a different method. This is another example of a simple update rule for the kernel that does not add new free parameters to the traditional MCC algorithm, although robustness is maintained.

Variable kernel width MCC (VKW-MCC)
The VKW-MCC algorithm calculates the kernel size at each iteration by maximizing exp(−e 2 /2σ 2 ) with respect to the kernel width σ [18]. For this purpose, the authors employ a modified cost function to reduce the interference of the kernel size. Instead of making J n = E[ G σ (e)], the following statement is considered: Using the gradient ascent approach, the modified MCC algorithm is given as Assuming that the noise is not impulsive, the work developed in [18] has also shown that the optimal kernel size in the kth iteration is given by: where k σ is a positive constant. In order to ensure a robust response to impulsive noise [23], the VKW-MCC method computes E[ |e k | ] instead of |e k | in Eq. (11), i.e., where τ is a smoothing factor that can assume any value between 0 and 1, and A e,k is a set of values |e k | in the form: being N w the length of the estimation window. Then, Eq. (11) can be rewritten as: The authors in [18] also mention that the VKW-MCC algorithm lacks robustness when σ k is too large. To prevent this from happening, the kernel size must be within an interval

Adaptive convex combination kernel width MCC (ACCMCC)
Recently, the authors from [19] defined a novel robust adaptive algorithm by convexly combining two MCC-based adaptive algorithms as in Eq. (6) with different step-sizes parameters μ, a larger and a smaller one. The two adaptive filters are then combined based in the error. However, this approach does not use any adaptive kernel technique in order to update the kernel size used by correntropy, which is another free parameter that could be optimized.
Given the recognized importance to properly define the kernel size parameter in algorithms based on correntropy, this work proposes the combination of the kernel size optimization strategy from [18] with the CMCC algorithm described in [19] in order to obtain an new algorithm with improved performance, which we defined as adaptive convex combination kernel width MCC (ACCMCC). This novel method is based on the convex combination of two adaptive filters with distinct convergence rates. This is achieved by setting a larger step μ 1 for the faster filter, and a smaller step μ 2 for the slower one. The parameter vectors w jk , j = 1, 2, are then updated according to the following rule: The outputs of the adaptive filters employed in ACCMCC are combined with a so-called mix parameter λ k to obtain the overall output as: where y 1k = w T 1k x k , y 2k = w T 2k x k , and w 1k and w 2k are the weight vectors of the filter with large step size and small step size, respectively. Once the overall output is calculated, the overall filter output error can be obtained as: Analogously to (16), the overall filter weight vector can be determined by combining w 1k and w 2k with the mixing parameter, resulting in: The desired behavior for the mixing parameter λ k in (18) lies in keeping it as close to 1 as possible when the algorithm starts, while it should tend to 0 as the ACCMCC begins to converge. According to [19], this behavior can be obtained by using a sigmoidal function to update parameter λ k i.e.
Expression 19, denotes that α k controls the update rate of λ k . In order to maintain robustness to impulsive noise, the adaptation of α k is expressed by an adaptive rule, which is obtained from the correntropy maximization with the stochastic ascent gradient method: where μ α is the adaptation step and η k corresponds to a simplified mathematical notation in the form: Similarly to the former CMCC method, the range for k is restricted to a symmetric interval [ −4, 4].
The proposed ACCMCC is summarized as Algorithm 1, where it is possible to notice that the VKW-MCC mechanism is adopted to adjust the kernel width for each one of the adaptive filters. Therefore, it is possible to state that ACCMCC is a hybrid approach derived from the combination of CMCC and VKW-MCC.

Impulsive noise
Several statistical models have been proposed so far to describe impulsive noise. One of the most popular statistical distributions for this purpose is the α-stable distribution [24]. The fundamental principles associated with alpha-stable modeling are based on the generalized central limit theorem [25]. By adjusting its respective free parameters, it is possible to generate various probability distribution functions, such as Gaussian, CauchyLorentz, or Lèvy.
In traditional signal processing, Gaussian noise is commonly used in system modeling due to its simple probability distribution function (PDF). With the rapid development of technology, it is nearly impossible to ignore the impact of nonGaussian noise, thus leading to the adoption of a generalized distribution. It is reported in literature as the so-called alpha-stable distribution, which is a unique solution that satisfies the generalized central limit theorem. Due to the lack of closed formulas for density and distribution functions, the alpha-stable distribution is commonly expressed in terms of a characteristic function [25]. Firstly, let us consider that the characteristic function of a random variable X has the form described in [26] as: where (ω, α, β) = β|ω| α−1 tan πα 15:

21: end for
It can be seen that α is the characteristic exponent in (22), which mainly characterizes the intensity of the impulsive noise (0 ≤ α ≤ 2). This parameter is responsible for controlling the impulsivity of the density function, i.e., the smaller the value of alpha, the longer the distribution tail will be. Particularly for α = 2, it determines the sign and degree of asymmetry. If β = 0, the distribution is symmetric. Finally, γ is the dispersion parameter, which is similar to the variance σ 2 in a Gaussian distribution and capable of measuring the degree of discretization of sample data.

Results and discussion
In this section, Monte-Carlo simulations are carried to verify the theoretical analysis and validate the behavior of the ACCMCC algorithm. The novel method is used in a system identification problem, while its performance is compared with those regarding the traditional MCC with fixed kernel width and four other methods reported in literature i.e. AMCC, SMCC, CMCC, and VKW-MCC, which use adaptive kernel widths. The system adopted in the simulation tests has nine taps and its respective vector of weight parameters is defined as in [27]   The parameter settings are listed in Table 1. Two different nonGaussian noise signals are individually used to contaminate the signal of the system identified by the MCC-based algorithms. Besides, each one of them corresponds to a distinct scenario. The performance of ACCMCC is compared with those regarding other similar methods using the weighted signal-to-noise ratio (WSNR) metric, which is expressed as [13]: It is worth to mention that the convergence curves for WSNR are usually used to compare the convergence performance of distinct algorithms.

Scenario 1-System under bimodal noise
In the first scenario, the system is under a nonGaussian noise described by a bimodal signal given by [1]: Figure 1 presents a COMPARISON among several methods under this condition. After the 500th sample, ACCMCC achieves improved performance if compared to the other methods. In steady-state, ACCMCC becomes stable when WSNR = 33dB. It can be also stated that the proposed method presents good convergence rate.  Figure 2 shows the values assumed by the adaptive kernel width for both adaptive filters used in ACCMCC. The initial values are initially large towards the search for a more convex space, i.e., where only few local minimums exist. When the algorithm is close to the solution, the kernel size is adjusted to find a more accurate solution. In particular, ACCMCC comprises a combination of two MCC adaptive filters with distinct kernel widths and steps.

Scenario 2-System under alpha-stable noise
The alpha-stable noise defined by 22 is used in the second scenario to contaminate the system under study. Tests using different noise intensities are performed to evaluate the behavior of ACCMCC. In this case, the generalized signal-to-noise ratio (GSNR) is employed and defined as [13]: where s is the original signal and recalling that γ is the dispersion parameter from the alpha-stable distribution. Different values of GSNR are used in the simulation tests, which range from 12 to 20 dB as shown in Fig. 3. All MCC-based algorithms present reduced efficiency as GSNR decreases. However, ACCMCC and CMCC are the only two methods with good performance over the entire range adopted for GSNR. The specifications for the alpha-stable parameters are listed in Table 2.
The performance of ACCMCC is compared with those regarding other MCC-based methods while varying GNSR from 12 dB to 20 dB according to Fig. 3. In particular, the method proposed in this work achieves the best results for all values assumed by GSNR. Figure 4 represents the performance of ACCMCC when the system is under an alphastable noise as α = 1.3 and GSRN = 15. It is possible to notice that ACCMCC becomes stable with about WSNR = 25dB after the 500th iteration, being this the best result achieved among all analyzed methods. It worth to mention that the performances of AMCC and SMCC are significantly improved if compared to that regarding ACCMCC as the noise becomes Gaussian, i.e., α tends to 2. According to Fig. 5, the proposed algorithm still outclasses MCC, CMCC, AMCC, SMCC, and VKW-MCC in terms of convergence rate, steady-state misalignment, and tracking performance in nonGaussian environments.
According to the previous analysis, the results demonstrate that the proposed method presents better accuracy and robustness in both noisy scenarios.

Conclusion
This work has presented a novel algorithm called adaptive convex combination maximum correntropy criterion, which is based on the VKW-MCC and CMCC methods. Since the kernel is used to calculate the value assumed by the cost function, modifying the kernel width implies changing the estimator of the cost function. The proposed method to adapt the kernel size can also be seen as a mechanism to use different cost functions depending on the nature of the data seen by the adaptive system. Table 2 Relationship between kernel size and step size In order to provide robustness against impulsive interference, the adaptive rule of the mixing factor is derived by maximizing correntropy with the stochastic gradient ascent method with two distinct learning rates. This approach has presented improved performance than those achieved by other existing methods, especially in the presence of highly impulsive noise. Simulation results have also shown that, even including new free parameters to the method, the proposed algorithm was able to effectively eliminate interference caused by impulsive noise conditions while also improving the estimation accuracy.