Pre-processing for noise detection in gene expression classification data

Due to the imprecise nature of biological experiments, biological data is often characterized by the presence of redundant and noisy data. This may be due to errors that occurred during data collection, such as contaminations in laboratorial samples. It is the case of gene expression data, where the equipments and tools currently used frequently produce noisy biological data. Machine Learning algorithms have been successfully used in gene expression data analysis. Although many Machine Learning algorithms can deal with noise, detecting and removing noisy instances from the training data set can help the induction of the target hypothesis. This paper evaluates the use of distance-based pre-processing techniques for noise detection in gene expression data classification problems. This evaluation analyzes the effectiveness of the techniques investigated in removing noisy data, measured by the accuracy obtained by different Machine Learning classifiers over the pre-processed data.


Introduction
Due to the imprecise nature of biological experiments, biological data is often characterized by the presence of redundant and noisy examples. This kind of data may originate, for example, from errors during data collection, such as contaminations of laboratorial samples. Gene expression data are examples of biological data that suffer from this problem. Although many Machine Learning (ML) algorithms can deal with noise, detecting and removing noisy instances from the training data set can help the induction of the target hypothesis.
Noise can be defined as an example apparently inconsistent with the remaining examples in a data set. The presence of noise in a data set can decrease the predictive performance of Machine Learning (ML) algorithms, by increasing the model complexity and the time necessary for its induction. Data sets with noisy instances are common in real world problems, where the data collection process can produce noisy data.
Data are usually collected from measurements related with a given domain. This process may result in several problems, such as measurement errors, incomplete, corrupted, wrong or distorted examples. Therefore, noise detection is a critical issue, specially in domains demanding security and reliability. The presence of noise can lead to situations that degrade the system performance or the security and trustworthiness of the involved information. A wide variety of noise detection applications can be found in different domains, such as fraud detection, loan application processing, intrusion detection, analysis of network performance and bottlenecks, detection of novelties in images, pharmaceutical research, and others 17 .
Different types of noise can be found in data sets, specially in those representing real problems (see Figure 1). In order to illustrate these different types, the instances of a given data set can be divided into five groups: Mislabeled cases: instances incorrectly classified in the data set generation process. These cases are noisy instances; Redundant data: instances that form clusters in the data set and can be represented by others. At least one of these patterns should be maintained so that the representativeness of the cluster is conserved; Outliers: instances too distinct when compared to the other examples of the data set. These instances can be either noisy or very particular cases and their influence in the hypothesis induction should be minimized; Gene expression data are, in general, represented by complex, high dimensional data sets, which are susceptible to noise. In fact, biological or real world data sets, and gene expressions data sets are part of it, present a large amount of noisy cases.
When using gene expressions data sets, some aspects may influence the performance achieved by ML algorithms. Due to the imprecise nature of biological experiments, redundant and noisy examples can be found at a high rate. Noisy patterns can corrupt the generated classifier and should be therefore removed 21 . Redundant and similar examples can be eliminated without harming the concept induction and may even improve it.
In order to deal with noisy data, several approaches and algorithms for noise detection can be found in the literature. This paper focus on the investigation of distance-based noise detection techniques, adopted in a pre-processing phase. This phase aims to identify possible noisy examples and remove them. In this work, three ML algorithms are trained with the original data sets and with different sets of pre-processed data produced by the application of noise detection techniques. By evaluating the difference of performance among classifiers generated over original (without pre-processing) and pre-processed data, the effectiveness of distance-based techniques in recognizing noisy cases can be estimated.
There are other works 18, 24 that look for noise in gene expression data sets but, different from this work, the experiments reported in these papers eliminate only genes. In the experiments performed here, we use noise detection techniques mainly to detect mislabeled tissues.
Details of the noise detection techniques used are presented in Section 2. The methodology employed in the experiments, the data sets used and ML algorithms adopted are described in Section 3. The results obtained are presented and discussed in Section 4. Finally, Section 5 has the main conclusions from this work.

Noise Detection
Different pre-processing techniques have been proposed in the literature for noise detection and removal. Statistical models were the earliest approaches used in this task, and some of them were applicable only to one-dimensional data sets 17 . In these approaches, noise detection is dealt with by techniques based on data distribution models 3 . The main problem of this method is the assumption that the data distribution is known in advance, which is not true for most real world problems.
Clustering techniques 8,16 are also applied to noise detection tasks. In these approach, small groups of data, disperse among the existent examples, are regarded as possible noise. A third approach employs ML classification algorithms, which are used to detect and remove noisy examples 34,19 . The work presented here follows a forth approach, in which noise detection problems are investigated by distance-based techniques 20,30,5,32 . These techniques are named distance-based because they use the distance between an example and its nearest neighbors.
Distance-based techniques are simple to implement and do not make assumptions about the data distribution. However, they require a large amount of memory space and computational time, resulting in a complexity directly proportional to data dimensionality and number of examples 17 . The most popular distance-based technique referred in literature is the k-nearest neighbor (k-NN) algorithm, which is the simplest algorithm belonging to the class of instancebased supervised ML techniques 25 . Distance-based techniques use similarity measures to calculate the distance between instances from a data set and use this information to identify possible noisy data. One of main questions regarding distance-based techniques relates to the similarity measure used in the calculus of distances.
For high dimensional data sets, the commonly used Euclidian metric is not adequate 1 , since data is commonly sparse. The HVDM (Heterogeneous Value Difference Metric) metric is shown by 36 as suitable to deal with high dimensional data and was therefore used in this paper. This metric is based on the distribution of the attributes in a data set, regarding their output values, and not only on punctual values, as is observed in the Euclidian distance and other similar distance metrics. Equation 1 presents the HVDM metric. VDMa(x a , z a ) is the distance VDM (Value Difference Metric) 29 , adequate for nominal attributes and a is the standard deviation of attribute a in the data set. Since the data sets employed in this paper do not present nominal attributes, the second row of Equation 2 is not used in this work.
The k-nearest neighbor (k-NN) algorithm was used for finding the neighbors of a given instance. This algorithm classifies an instance according to the class of the majority of its k nearest neighbors. The value of the k parameter, which represents the number of nearest neighbors of the instance, influences the performance of the k-NN algorithm. Typically, it is an odd and small integer, such as 1, 3 or 5.
The techniques evaluated in this paper are the noise detection filters Edited Nearest Neighbor (ENN), Repeated ENN (RENN) and AllkNN, all based on the k-NN algorithm.
In order to explain the techniques evaluated, let T be the original training set and S be a subset of T, obtained by the application of any of the distance-based techniques evaluated. Now, suppose that T has n instances x 1 , ..., x n . Each instance x of T (and also of S) has k nearest neighbors.
The ENN algorithm was proposed in 37 . Initially, S = T, and an instance is considered noise and then removed from the data set if its class is different from the class of the majority of its k nearest neighbors. This procedure removes mislabeled data and borderlines. In the RENN technique, the ENN algorithm is repeatedly applied to the data set until all its instances have the majority of its neighbors with the same class. Finally, the AllkNN algorithm was proposed in Tomek 31 and is also an extension of ENN algorithm.This algorithm proceeds as follows: for i = (1, . . . , k), mark as incorrect (possible noise) any instance incorrectly classified by its i nearest neighbors. After the analysis of all instances in the data set, it removes the signalized instances.
Despite the large number of existent techniques used in noise detection problems, it is possible to find also recent studies that use hybrid systems, as well as ensembles of classifiers, to improve system performance and reduce deficiencies of the applied algorithms. Hybridization is used variously to overcome deficiencies with one particular classification algorithm, exploiting the advantages of multiple approaches while overcoming their weaknesses 17 .

Experiments
The experiments performed employed the 10-fold cross validation methodology 25 . All selected data sets were presented to the noise detection techniques investigated. Next, their pre-processed versions, resulting from the application of each noise detection technique, were presented to the three ML algorithms employed. The original version of each data set used in the experiments was also presented directly to the ML algorithms, aiming to compare the performance obtained by ML algorithms with the original data sets and with their pre-processed versions. The error rate obtained by the ML algorithms was calculated by the average of the individual errors obtained for each test partition. Each noise detection technique was applied 10 times, one for each training partition of the data set produced by the 10-fold cross validation methodology.
The experiments were run in a 3.0 GHz Intel Pentium 4 dual processor PC with 1.0 Gb of RAM memory. For the noise detection techniques evaluated, the code provided by 35 was used. The values of the k parameter, which define the number of nearest neighbors, were set as 1, 3 or 9, to follow a geometric progression that includes the number three, which is the default value of the mentioned code.
The ML algorithms investigated were C4.5, used for the induction of Decision Trees, RIPPER, which produces a set of rules from a data set and Support Vector Machines (SVMs), which looks for representative examples to improve the generalization of the decision border.
The C4.5 algorithm 27 uses a greedy approach to progressively grow a decision tree whose leaf nodes represent classes. C4.5 deals with noise data by using a pruning procedure. In this procedure, ramifications of the trained tree that present, according to some criterion, low expressive power, are pruned. This procedure aims to simplify the built tree and to reduce its classification error rate.
The RIPPER algorithm (Repeated Incremental Pruning to Produce Error Reduction) 6 is a rule induction algorithm proposed to obtain low classification error rates even in the presence of noise and high dimensional data. Rule induction algorithms are more flexible than decision trees algorithms, like C4.5, since new rules can be added or modified as new data are included 17 .
SVMs are learning algorithms based on the statistical learning theory, through the principle of Structural Risk Minimization (SRM) 33 . SVMs accomplish a non-linear data analysis in a high dimension space where a maximum margin hyperplane can be built, allowing the separation of positive and negative classes. They present high generalization ability, are robust to high dimensional data and have been successfully applied to the solution of several classification problems 28, 9 . In the experiments reported in this paper, we used data sets obtained from gene expression analysis, particularly tissue classification. Gene expression analysis problems are, in general, represented by complex and high dimensional data sets, which are very susceptible to noise. Table 1 shows the format of the gene expression data sets used in the experiments. It shows that each data set can be represented by a table where the first row has the identification of a particular tissue, the expression levels of different genes for this tissue and the label associated to the tissue.
The main features of the gene expression data sets used in the experiments are described in Table 2. This table presents, for each data sets, its total number of instances, number of attributes or data dimensionality and existent classes.
Most of the data sets used in the experiments reported in this paper are related to the problem of cancer tissue classification. The development of efficient data analysis tools to support experts may allow better and earlier diag-nosis of cancer, leading to more effective patient treatment and increase of survival rates. Several research groups are currently working with gene expression analysis of tumor tissues.
The ExpGen data set 4 contains expression levels measurements from 2467 genes obtained from 79 different laboratory experiments for genes functional classification. This application consists in categorize a gene in a given class that represent its function in the cellular environment. From these experiments, the data set is composed by only 207 genes, which could be categorized into five classes during the laboratorial experiments made.
The Golub data set 15 has gene expression levels from patients with acute leukemia. The gene expression data were obtained from 72 microarray images, and measure expression levels of 6817 human genes. The disease was categorized in two different types, Acute Lymphoid Leukemia (ALL) and Acute Myeloid Leukemia (AML). The same pre-processing made in 11 was applied to Golub data set to simplify its data.
The Leukemia data set is known in literature as St. Jude Leukemia 38 . It is composed by six different types of pediatric acute lymphoid leukemia and another group with examples which could not be categorized as one of the previous six types. The original data set has 12558 genes and so a pre-processed version found in http://sdmc.lit.org.sg/GEDatasets and described by 38 research was used, reducing the number of genes to 271.
The Lung data set has examples related to lung cancer, where, for each patient, the label can be normal tissue or three different types of lung cancer. The three different types of lung cancer analyzed are adenocarcinomas (ADs), squamous cell carcinomas (SQs) and carcinoid (COID). This data set has 197 instances, with 1000 attributes each, and was presented in 26 .
The last data set analyzed, the Colon data set, is described in Alon et al. 2 , and includes patients with and without colon cancer. The data set presents gene expression data obtained from 62 microarrays images, which measure expression levels of 6500 human genes. Pre-processing techniques reduced the number of input attributes to 2000.
For the SVMs training, the SVMTorch II 7 software was employed. The values of different SVMs parameters were the default values of the software used, kept the same for all experiments. For the C4.5, training was carried out by the software provided by Quinlan 27 and For the RIPPER algorithm training, the Weka simulator from Waikato university 13 was adopted. The parameter values for the three algorithms were the default values suggested in the tools employed, which were kept the same for all experiments. Scripts in perl programming language were also developed to convert data sets to different formats demanded by Wilson's 35 code, SVMTorch II, Weka simulator and C4.5 algorithm.
To evaluate results obtained in the experiments, the statistical test of Friedman 14 and Dunn's multiple comparisons post-hoc test 12 were employed, according to the methodology described in 10 . Friedman's test was adopted since it is recommended for the comparison of different ML algorithms applied to multiple data sets, and has the advantage of not assuming that the measurements have to follow a Normal distribution.
The null hypothesis assume that all analyzed algorithms are equivalent if their respective mean ranks are the same. If the null hypothesis is rejected, and therefore the analyzed algorithms are statistically different, a post-hoc test might be applied to detect which of the algorithms differ. Dunn's statistical post-hoc test was applied, since it is recommended to situations where all algorithms analyzed are compared to a control algorithm, the strategy employed in the experiments performed in this paper.

Experimental Results
In the pre-processing, the amount of removed instances was different for each data set analyzed. However, it was between 20 and 30% of the total number, except for the Colon data set, original and simplified versions, which presented reductions between 30 and 40%.
The time spent in the pre-processing phase was measured to show how the application of the noise detection techniques investigated can affect the overall processing time. It is important to mention that pre-processing phase is only applied once for each data set analyzed, generating a pre-processed data set that can be used several times for different ML algorithms. The time consumed was always less than one minute. Another observation is related to data sets complexity: more time was spent in the pre-processing of more complex data sets.
In order to measure the effectiveness of noise detection techniques employed, the performance of the three ML algorithms concerning accuracy, complexity and processing time necessary to build the induced hypothesis were evaluated with the original and the pre-processed data. For all experiments, the statistical tests were applied with 95% of confidence level.
For SVMs, in general, the error rates of the classifiers generated after the application of noise detection techniques, for all evaluated k values, were the same as those obtained for the original data sets. The same was true for the Colon data set, but only for some values of k. The pre-processed data sets Leukemia and ExpGen had only some similar results, but none better than those obtained for the original data sets, while Golub data set presented the worst results in all cases. The obtained results can be seen in Table 3, where the best results are highlighted in bold and error rates similar to the best ones for each data set are shown in italics. Standard deviation rates are reported in parenthesis.
The analysis of the C4.5 classification error rates, which can be seen in Table 4, shows that the pre-processed data sets Leukemia, Lung and Golub presented similar and better results than those obtained for the original data sets. The ExpGen data set presented only few similar error rates compared to those obtained for the original data set. The preprocessed data set Colon provided only worst results.
According to Table 5, the RIPPER algorithm presented similar error performance for the original and pre-processed data using the Leukemia, ExpGen and Colon data sets. In the last two data sets, some results were improved by the preprocessing. The remaining pre-processed data sets Lung and Golub presented more improvements in ML accuracy after the pre-processing phase. For these two data sets, error rates were lower after pre-processing, for the majority of the experiments carried out.
In the complexity analysis of the SVMs, the number of Support Vectors (SVs), data that determine the decision border induced by SVMs, was considered. A smaller number of SVs indicates less complexity of the induced model.
For the C4.5 algorithm, complexity was determined by the mean decision tree size induced. Reduced decision trees are easier to analyze and so result in comprehensiveness improvements for the model.
The complexity for RIPPER algorithm was observed by the number of rules produced during the training phase. The smaller the number of rules produced, the simpler the complexity of the generated model.
For all three ML algorithms investigated, the complexity was reduced when the pre-processed data sets were used, as presented in Tables 6, 7 and 8, respectively for the SVM, C4.5 and RIPPER algorithms. In these tables, the best results are highlighted in bold and complexities similar to the best ones, for each data set, are shown in italics. Standard deviation rates are reported in parenthesis.
According to Tables 6, 7 and 8, most of the complexities were reduced after pre-processing, except for the Golub data set and the RIPPER algorithm, in which not all complexities were reduced.
For the SVMs, the smaller the pre-processed data set produced by noise detection techniques, the lower the number of SVs obtained and, consequently, the complexity of the model. For the C4.5 algorithm, the model complexity has decreased until a lower bound from which further reduction in pre-processed data set would not reduce the complexity.
For the RIPPER algorithm, the final models were also simplified, but with less reduction in the complexity. The complexity obtained using the original data for the Golub data set was maintained for its pre-processed versions.
The time taken by the SVM, C4.5 and RIPPER algorithms to induce hypothesis using the pre-processed data sets was always reduced when compared to those obtained with the original data sets, taking at most 1 second. For SVMs, the processing time was only slightly reduced in comparison to the time obtained for the original data sets.
The analysis of results presented in this paper shows that the three noise detection techniques evaluated presented similar results, in terms of amount of noise removed (data set reduction), time taken and effect on the ML algorithms performance. A possible explanation is that they all are noise filtering techniques based on k-NN algorithm. Besides, they are related, AllkNN is an ENN extension, while RENN is the ENN algorithm applied multiple times. For the gene expression data sets analyzed in this paper, the differences present in these algorithms may not result in significant differences in the ML algorithms performance. Most of the experiments presented satisfactory results, with lower error rates and better performance if compared to those obtained in the analysis of the original data sets, which demonstrates that noise detection techniques improved the performance of the ML algorithms evaluated. The C4.5 and RIPPER algorithms benefited from the application of noise detection techniques for most of the data sets investigated and reduced the complexity of the induced models. For the SVMs, the new results were slightly better, with lower complexity.
Furthermore, the gain in comprehensiveness and the reduction in time spent during training process is another advantage, since the complexities of all data sets were reduced after pre-processing (the noise detection and removal phase).
Therefore, the application of noise detection techniques in a pre-processing phase presents the advantage of reducing the complexity of classifiers induced by ML algorithms, as well as reducing the time spent in classifiers training, producing, in most experiments, better or similar classification error results than those obtained for the original data sets. This indicates that the distance-based noise detection techniques kept the most expressive patterns of the data sets and allowed ML algorithms to induce simpler classifiers, as shown in the reduced complexity and lower classification error rates obtained.

Conclusions
This paper investigated the application of distance-based noise detection techniques in different gene expression classification problems. We did not found in the literature a single approach or algorithm able to detect noise without classification accuracy reduction that was tested in several data sets. We also were not able to find noise detection experiments using gene expression data sets able to detect tissues that are probably noise. The closest works we found in gene expression analysis were the works from 18,24 . However, these works detect and eliminate only genes, not tissues. The data sets employed here are related to both gene classification and tissue classification.
In the experiments performed here, three ML algorithms were trained over the original and pre-processed data sets. They were employed to evaluate the power of these techniques in maintaining the most informative patterns. The results observed indicate that the noise detection techniques employed were effective in the noise detection process. These experiments shown the the incorporation of noise detection and elimination resulted in simplifications of the ML classifiers and in reduction in their classification error rates, specially for the C4.5 and RIPPER algorithms. Another advantage for these two algorithms was an increase in comprehensiveness.
We are now investigating new distance-based techniques for noise detection and developing ensembles of noise detec-tion techniques aiming to further improve the gains obtained by the identification and removal of noisy data. Preliminary results, presented in Libralon 23 , suggest that ensembles of distance-based techniques can be a good alternative for noise detection in gene expression data sets.