Pre-processing for noise detection in gene expression classification data

Libralon, Giampaolo Luiz; de Carvalho, André Carlos Ponce de Leon Ferreira; Lorena, Ana Carolina

doi:10.1007/BF03192573

Open access
Published: March 2009

Pre-processing for noise detection in gene expression classification data

Giampaolo Luiz Libralon¹,
André Carlos Ponce de Leon Ferreira de Carvalho¹ &
Ana Carolina Lorena²

Journal of the Brazilian Computer Society volume 15, pages 3–11 (2009)Cite this article

1006 Accesses
22 Citations
Metrics details

Abstract

Due to the imprecise nature of biological experiments, biological data is often characterized by the presence of redundant and noisy data. This may be due to errors that occurred during data collection, such as contaminations in laboratorial samples. It is the case of gene expression data, where the equipments and tools currently used frequently produce noisy biological data. Machine Learning algorithms have been successfully used in gene expression data analysis. Although many Machine Learning algorithms can deal with noise, detecting and removing noisy instances from the training data set can help the induction of the target hypothesis. This paper evaluates the use of distance-based pre-processing techniques for noise detection in gene expression data classification problems. This evaluation analyzes the effectiveness of the techniques investigated in removing noisy data, measured by the accuracy obtained by different Machine Learning classifiers over the pre-processed data.

References

Aggarwal CC, Hinneburg A, Keim DA. On the surprising behavior of distance metrics in high dimensional space. In:Proceedings of the 8 ^th Int. Conf. on Database Theory, LNCS —vol. 1973; 2001; London. Springer-Verlag; 2001. p. 420–434.
Chapter Google Scholar
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. In:Proceedings of National Academy of Sciences of the United States of America; 1999. USA: The National Academy of Sciences; 1999. p. 6745–6750.
Google Scholar
Barnett V, Lewis T.Outliers in statistical data. 3 ed. New York: Wiley Series in Probability & Statistics, John Wiley and Sons; 1994.
Brown M, Grundy W, Lin D, Christianini N, Sugnet CM Jr., Haussler D.Support vector machine classification of microarray gene expression data. Santa Cruz, CA 95065: University of California; 1999. Technical Report UCSC-CRL-99-09.
Google Scholar
Chien-Yu C. Detecting homogeneity in protein sequence clusters for automatic functional annotation and noise detection. In:Proceedings of the 5th Emerging Information Technology Conference; 2005; Taipei.
Cohen WW. Fast effective rule induction. In:Proceedings of the 12th International Conference on Machine Learning; 1995. Tahoe City, CA: Morgan Kaufmann; 1995. p. 115–123.
Google Scholar
Collobert R, Bengio S. SVMTorch: support vector machines for large-scale regression problems.The Journal of Machine Learning Research 2001; 1:143–160.
Article MathSciNet Google Scholar
Corney DPA.Intelligent analysis of small data sets for food design London: Computer Science Department, London University College; 2002.
Google Scholar
Cristianini N, Shawe-Taylor J.An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press; 2000.
Google Scholar
Demsar J. Statistical comparisons of classifiers over multiple datasets.Journal of Machine Learning Research 2006; 7:1–30.
MathSciNet Google Scholar
Dudoit S, Fridlyand J, Speed TP.Comparison of discrimination methods for the classication of tumors using gene expression data. UC Berkeley: Department of Statistics; 2000. Technical Report 576.
Google Scholar
Dunn OJ. Multiple comparisons among means.Journal of American Statistical Association 1961; 56(293):52–64.
Article MATH MathSciNet Google Scholar
Frank E, Witten IH.Data mining: practical machine learning tools and techniques. San Francisco: Morgan Kaufmann; 2005.
MATH Google Scholar
Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance.Journal of American Statistical Association 1937; 32(200):675–701.
Article Google Scholar
Golub TR, Tamayo P, Slonim D, Mesirow J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. In:Proceedings of National Academy of Sciences; 1999. USA: The National Academy of Sciences; 1999; 96(6):2907–2912.
Google Scholar
He Z, Xu X, Deng S. Discovering cluster-based local outliers.Pattern Recognition Letters 2003; 24(9–10):1641–1650.
Article MATH Google Scholar
Hodge V, Austin J. A survey of outlier detection methodologies.Artificial Intelligence Review 2004; 22(2):85–126.
Article MATH Google Scholar
Hu J. Cancer outlier detection based on likelihood ratio test.Bioinformatics 2008; 24(19):2193–2199.
Article Google Scholar
Khoshgoftaar TM, Rebours P. Generating multiple noise elimination filters with the ensemble-partitioning filter. In:Proceedings of the IEEE International Conference on Information Reuse and Integration; 2004. p. 369–375.
Knorr EM, Ng RT, Tucakov V. Distance-based outliers: algorithms and applications.The VLDB Journal 2000; 8(3–4):237–253.
Article Google Scholar
Lavrac N, Gamberger D. Saturation filtering for noise and outlier detection. In:Proceedings of the Workshop in Active Learning, Database Sampling, Experimental Design: Views on Instance Selection, 12th European Conference on Machine Learning; 2001. p. 1–4.
Lorena AC, Carvalho ACPLF. Evaluation of noise reduction techniques in the splice junction recognition problem.Genetics and Molecular Biology 2004; 27(4):665–672.
Article Google Scholar
Libralon GL, Lorena AC, Carvalho ACPLF. Ensembles of pre processing techniques for noise detection in gene expression data. In:Proceedings of 15th International Conference on Neural Information Processing of the Asia-Pacific Neural Network Assembly; ICONIP2008; Auckland, New Zealand. 2008. p. 1–10.
Liu W. Outlier detection for microarray data. In:Proceedings of the 2 ^nd International Conference on Bioinformatics and Biomedical Engineering — ICBBE; 2008; Shanghai. p. 585–586.
Mitchell T.Machine learning. USA: McGraw Hill; 1997.
MATH Google Scholar
Monti S, Tamayo P, Mesirov J, Golub T. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data.Machine Learning 2003; 52(1–2):91–118.
Article MATH Google Scholar
Quinlan JR.C4.5: programs for machine learning. San Francisco, CA: Morgan Kaufmann; 1993.
Google Scholar
Schlkopf B.SVMs: a practical consequence of learning theory.IEEE Intelligent Systems 1998; 13(4):36–40.
Google Scholar
Stanfill C, Waltz D. Toward memory-based reasoning.Communications of the ACM 1986; 29(12):1213–1228.
Article Google Scholar
Tang J, Chen Z, Fu AW, Cheung D. A robust outlier detection scheme in large data sets. In:Proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining; 2002; Taipei. p. 535–548.
Tomek I. Two modifications of CNN.IEEE Transactions on Systems, Man and Cybernetics 1976; 7(11):769–772.
MathSciNet Google Scholar
32. Van Hulse JD, Khoshgoftaar TM, Huang H. The pairwise attribute noise detection algorithm.Knowledge and Information Systems 2007; 11(2):171–190.
Article Google Scholar
Vapnik VN.The nature of statistical learning theory. 2 ed. Berlim: Springer-Verlag; 1995.
MATH Google Scholar
Verbaeten S, Assche AV. Ensemble methods for noise elimination in classification problems. In:Proceedings of the 4th International Workshop on Multiple Classifier Systems; 2003. Berlim: Springer; 2003. p. 317–325.
Google Scholar
Wilson DR, Martinez TR. Reduction techniques for instance-based learning algorithms.Machine Learning 2000; 38(3):257–286.
Article MATH Google Scholar
Wilson DR, Martinez TR. Improved heterogeneous distance functions.Journal of Artificial Intelligence Research 1997; 6(1):1–34.
MATH MathSciNet Google Scholar
Wilson DL. Asymptotic properties of nearest neighbor rules using edited data.IEEE Transactions on Systems, Man and Cybernetics 1972; 2(3):408–421.
Article MATH Google Scholar
Yeoh EJ, Ross ME, Shurtle SA, Williams WK, Patel D, Mahfouz R. et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling.Cancer Cell 2002; 1(2):133–143.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Mathematics and Computer Sciences - ICMC, University of São Paulo - USP, PO Box 668, 13560-970, São Carlos, SP, Brazil
Giampaolo Luiz Libralon & André Carlos Ponce de Leon Ferreira de Carvalho
Mathematics, Computing and Cognition Center - CMCC, Federal University of ABC - UFABC, 09210-170, Santo André, SP, Brazil
Ana Carolina Lorena

Authors

Giampaolo Luiz Libralon
View author publications
You can also search for this author in PubMed Google Scholar
André Carlos Ponce de Leon Ferreira de Carvalho
View author publications
You can also search for this author in PubMed Google Scholar
Ana Carolina Lorena
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Libralon, G.L., de Carvalho, A.C.P.d.L.F. & Lorena, A.C. Pre-processing for noise detection in gene expression classification data. J Braz Comp Soc 15, 3–11 (2009). https://doi.org/10.1007/BF03192573

Download citation

Received: 27 August 2008
Accepted: 01 March 2009
Issue Date: March 2009
DOI: https://doi.org/10.1007/BF03192573

Pre-processing for noise detection in gene expression classification data

Abstract

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords