, Outlier Mining

, Handling Missing and Duplicate Data References Books BATINI, Carlo, SCANNAPIECO, Monica. Data Quality Concepts, Methodologies and Techniques. Data-Centric Systems and Applications, Change Detection, vol.3, 2006.

B. , V. Lewis, and T. , Outliers in statistical data, 1994.

D. Tamraparni, J. , and T. , Exploratory Data Mining and Data Cleaning, 2003.

H. and D. , Identification of Outliers, 1980.

H. , T. N. Scheuren, F. J. Winkler, and W. E. , Data Quality and Record Linkage Techniques, 2007.

K. , R. , C. , and J. , The Data Warehouse ETL Toolkit, 2004.

N. and F. Quality, Driven Query Answering for Integrated Information Systems, Lecture Notes in Computer Science, vol.2261, 2002.

J. Tukey and . Wilder, Exploratory Data Analysis, 1977.

W. , R. Y. Ziad, L. Mostapha, and Y. W. , Data Quality. Advances in Database Systems, vol.23, 2002.

C. Surveys, . Varun, . Banerjee, . Arindam, . Kumar et al., Anomaly Detection A Survey, ACM Computing Surveys, 2009.

E. , A. K. Ipeirotis, G. Panagiotis, . Verykios, and S. Vassilios, Duplicate Record Detection A Survey, IEEE Transations on knowledge and Data Engineering (TKDE), vol.19, issue.1, pp.1-16, 2007.

J. Hellerstein, Quantitative Data Cleaning for Large Databases White paper, United Nations Economic Commission for Europe Gonzalo. A Guided Tour to Approximate String Matching. ACM Comput. Surv, vol.33, issue.1, pp.31-88, 2001.

W. and W. E. , Overview of Record Linkage and Current Research Directions A Survey of Data Quality Issues in Cooperative Systems, Tech. Rep. of U.S. Census Bureau, 2004.

N. Koudas, . Sarawagi, . Sunita, and D. Srivastava, Record Linkage Similarity Measures and Algorithms, 2006.

A. Banerjee, C. Varun, . Kumar, . Vipin, . Srivastava-jaideep et al., Anomaly Detection A Tutorial, Tutorial SIAM Conf. on Data Mining, 2008.

K. , H. Kroger, . Peer, . Zimek, F. Arthur et al., Outlier Detection Techniques. Tutorial, PAKDD Telcordia's Database Reconciliation and Data Quality Analysis Tool, Proc. VLDB, pp.615-618, 2000.

D. , T. , J. , T. , S. Muthukrishnan et al., Mining Database Structure; Or, How to Build a Data Quality Browser, Proc. SIGMOD, 2002.

D. Preparation, D. Q. Mining, H. , J. Guntzer, U. Grimmer et al., Data Quality Mining -Making a Virtue of Necessity, Proc. Workshop DMKD, 2001.

L. , D. Grimmer, U. Jarke, and M. , Systematic Development of Data Mining-Based Data Quality Tools, Proc. VLDB 2003, pp.548-559, 2003.

K. and R. B. , Data Preparation and Screening, Chapter 3, Principles and Practice of Structural Equation Modeling, pp.45-62, 2005.

P. , R. K. Surveying, . Data-for-patchy-structure, . Sdm, . Bilke et al., STATNOTES Topics in Multivariate Analysis Retrieved 10 Automatic Data Fusion with HumMer, Proc. VLDB. A Primitive Operator for Similarity Joins in Data Cleaning. Proc. ICDE, 1254.

C. and P. , Febrl an open source data cleaning, deduplication and record linkage system with a graphical user interface, pp.1065-1068, 2008.

C. Peter, C. Tim, and X. Zhu, Probabilistic name and address cleaning and standardization, Proc. Australasian Data Mining Workshop- Augustin. Declarative Data Cleaning Language, Model, and Algorithms, Proc. VLDB Conf, pp.371-380, 2001.

H. , M. Stolfo, and S. , Real-World Data is Dirty Data Cleansing and the Merge/Purge Problem, Data Mining and Knowledge Discovery, vol.2, issue.1, pp.9-37, 1998.

R. , E. Do, and H. H. , Data Cleaning Problems and Current Approaches, Data Engineering Bulletin, vol.23, issue.4, pp.3-13, 2000.

R. , V. Hellerstein, and J. M. , Potter's Wheel: An Interactive Data Cleaning System, Proc. VLDB, pp.381-390, 2001.

V. , P. Vagena, Z. Skiadopoulos, S. Karayannidis, N. Sellis et al., ARKTOS A Tool For Data Cleaning and Transformation in Data Warehouse Environments, Bulletin of the Technical Committee on Data Engineering, vol.23, issue.4, pp.42-47, 2000.

V. , P. , K. A. Tziovara, V. Simitsis, and A. , Towards a Benchmark for ETL Workflows, Proc. QDB, pp.49-60, 2007.

M. Weis and I. Manolescu, XClean in Action (Demo), pp.259-262, 2007.

, References Record Linkage and duplicate detection

A. , R. , C. Surajit, G. , and V. , Eliminating Fuzzy Duplicates in Data Warehouses, Proc. of VLDB, pp.586-597, 2002.

N. Bansal, A. Blum, C. , and S. , Correlation clustering, Machine Learning, pp.89-113, 2004.

B. , R. A. Christen, C. Peter, and T. , A Comparison of Fast Blocking Methods for Record Linkage, Proc. of the KDD'03 Workshop on Data Cleaning, Record Linkage and Object Consolidation, pp.27-29, 2003.

I. Bhattacharya, G. , and L. , Iterative record linkage for cleaning and integration, Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery , DMKD '04, pp.11-18, 2004.
DOI : 10.1145/1008694.1008697

I. Bhattacharya, G. , and L. , Collective entity resolution in relational data, ACM Transactions on Knowledge Discovery from Data, vol.1, issue.1, 2007.
DOI : 10.1145/1217299.1217304

M. Bilenko and R. J. Mooney, Adaptive duplicate detection using learnable string similarity measures, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining , KDD '03, pp.39-48, 2003.
DOI : 10.1145/956750.956759

M. Bilenko, . Basu, S. Sugato, and M. , Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping, Fifth IEEE International Conference on Data Mining (ICDM'05), pp.58-65, 2005.
DOI : 10.1109/ICDM.2005.18

C. and P. , Automatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine Classification, ACM SIGKDD Conf, 2008.

E. , M. G. Elmagarmid, A. K. Verykios, V. S. Tailor, and . Box, Proc. of the 18th International Conf. on Data Engineering, pp.17-28, 2002.

E. , A. K. Ipeirotis, P. G. Verykios, and V. S. , Duplicate Record Detection A Survey, IEEE Trans. Know. Data Eng, vol.19, issue.1, pp.1-16, 2007.

F. , I. P. Sunter, and A. B. , A Theory for Record Linkage, Journal of the American Statistical Association, vol.64, pp.1183-1210, 1969.

, References Record Linkage and duplicate detection

L. Gravano, I. Panagiotis, G. Jagadish, H. V. Koudas, . Nick et al., Using q-grams in a DBMS for Approximate String Processing, IEEE Data Eng. Bull, vol.24, issue.4, pp.28-34, 2001.

L. Gravano, I. , P. G. Koudas, N. Srivastava, and D. , Text joins for data cleansing and integration in an RDBMS, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405), pp.729-731, 2003.
DOI : 10.1109/ICDE.2003.1260850

H. , M. Stolfo, and S. , The Merge/Purge Problem for Large Databases, Proc. SIGMOD Conf pg 127-135, 1995.

L. , W. Lup, L. Mong-li, L. , and T. Wang, A Knowledge-Based Approach for Duplicate Elimination in Data Cleaning, Inf. Syst, vol.26, issue.8, pp.585-606, 2001.

. Kang, . Hyunmo, . Getoor, . Lise, . Shneiderman et al., Interactive Entity Resolution in Relational Data: A Visual Analytic Tool and Its Evaluation, IEEE Transactions on Visualization and Computer Graphics, vol.14, issue.5, pp.999-1014, 2008.
DOI : 10.1109/TVCG.2008.55

A. Mccallum, . Nigam, U. Kamal, and L. H. , Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining , KDD '00, pp.169-178, 2000.
DOI : 10.1145/347090.347123

M. and A. E. , Matching Algorithms within a Duplicate Detection System, IEEE Data Eng. Bull, vol.23, issue.4, pp.14-20, 2000.

S. Tejada, K. , C. A. Minton, and S. , Learning domain-independent string transformation weights for high accuracy object identification, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining , KDD '02, pp.350-359, 2002.
DOI : 10.1145/775047.775099

M. Weis, F. Naumann, B. , and F. , A Duplicate Detection Benchmark for XML (and Relational) Data, Proc. ACM SIGMOD 2006 Workshop on Information Quality in Information Systems, 2006.

W. and W. E. , Methods for Evaluating and Creating Data Quality, Inf. Syst, vol.29, issue.7, pp.531-550, 2004.

W. , W. E. Thibaudeau, and Y. , An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census, U.S. Bureau of the Census, 1991.

R. Inconsistencies, B. Philip, F. Wenfei, . Geerts, J. Floris et al., Anastasios Conditional Functional Dependencies for Data Cleaning, Proc. ICDE, pp.746-755, 2007.

. Bravo, F. Loreto, M. Wenfei, and S. , Extending Dependencies with Conditions, Proc. VLDB, pp.243-254, 2007.

C. , S. Di, G. Francesco, L. , and P. Luca, Mining constraint violations, ACM Trans. Database Syst, vol.32, issue.1, p.6, 2007.

C. , A. Koudas, P. Nick, and K. Srivastava-divesh, Fast Identication of Relational Constraint Violations, Proc. ICDE, 2007.

F. Wenfei, . Geerts, and . Floris, Anastasios Conditional functional dependencies for capturing data inconsistencies, TODS, issue.2, p.33, 2008.

F. Wenfei, . Geerts, and . Floris, Xibei Semandaq A Data Quality System Based on Conditional Functional Dependencies, p.8, 2008.

F. Wenfei, . Geerts, L. Floris, V. S. Laks, and M. Xiong, Discovering Conditional Functional Dependencies, Proc. ICDE, pp.1231-1234, 2009.

G. Lukasz, K. , H. J. Korn, . Flip, Y. Srivastava-divesh et al., On generating near-optimal tableaux for conditional functional dependencies, PVLDB, vol.1, issue.1, pp.376-390, 2008.

F. Korn and M. S. Zhu, Yunyue Checks and Balances Monitoring Data Quality Problems in Network Traffic Databases, Proc. VLDB 2003, pp.536-547

C. Detection, A. , and C. C. , A framework for diagnosing changes in evolving data streams, Proc. ACM SIGMOD, 2003.

D. , T. Krishnan-s, D. Lin, S. Venkatasubramanian, Y. et al., Change (Detection) you can believe in Finding Distributional Shifts in Data streams, Proc. IDA'09, 2009.

D. , T. Krishnan-s, S. Venkatasubramanian, Y. , K. Song et al., An information-theoretic approach to detecting changes in multi-dimensional data streams Statistical change detection for multidimensional data, Proc. Interface'06 Proc. ACM SIGKDD'07, pp.667-676, 2006.

, References Outlier Detection, issue.12

A. and D. , Detecting anomalies in cross-classified streams a Bayesian approach, Know. Inf. Syst, vol.11, issue.1, pp.29-44, 2006.

A. , F. Prizzuti, and C. , Fast Outlier Detection in High Dimensional Spaces, Proc. Conf. on Principles of Data Mining and Knowledge Discovery, pp.15-26, 2002.

B. , D. S. Schwabacher, and M. , Mining distance-based outliers in near linear time with randomization and a simple pruning rule, Proc. KDD, 2003.

B. , M. Kriegel, H. Ng, R. T. Sander, and J. , LOF Identifying Density-Based Local Outliers, Proc. of the 2000 ACM SIGMOD International Conf. on Management of Data, pp.93-104, 2000.

C. , Y. Dang, X. Peng, H. , B. et al., Outlier detection with the kernelized spatial depth function, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

C. , G. Hadjieleftheriou, and M. , Finding frequent items in data streams, Proc. VLDB, 2008.

E. and E. , Anomaly detection over noisy data using learned probability distributions, Proc. ICML, pp.255-262, 2000.

F. , P. Maronna, R. Werner, and M. , Outlier detection in high dimensions, Computational Statistics and Data Analysis, vol.52, pp.1694-1711, 2008.

G. , P. Pena, D. Tsay, and R. S. , Outlier detection in multivariate time series by projection pursuit, Journal of American Statistical Association, vol.101, issue.474, pp.654-669, 2006.

H. , F. Wang, Y. , W. H. He, Z. Xu et al., Odabk: An effective approach to detecting outlier in data stream Discovering cluster-based local outliers, Proc. Intl. Conf. on Mach. Learn. and Cybernetics, pp.1036-10419, 2003.

H. , M. Vader-veeken, and S. , Outlier detection for skewed data, Journal of Chemometrics, vol.22, pp.235-246, 2007.

J. , S. Li, Q. Li, K. Wang, H. Meng et al., GLOF a new approach for mining local outlier, Mining Top-n Local Outliers in Large Databases. Proc. KDD, pp.157-162, 2001.

K. , D. Ben-david, S. Gehrke, and J. , Detecting changes in data streams, Proc. VLDB 2004, pp.180-191, 2004.

, Outlier Detection, issue.22

K. , E. M. Ng, and R. T. , Algorithms for Mining Distance-Based Outliers in Large Datasets, Proc. VLDB, pp.392-403, 1998.

L. , R. Singh, K. Teng, and J. , Ddma-charts: Nonparametric multivariate moving average control charts based on data depth Advances in Statistical Analysis, pp.235-258, 2004.

K. , H. Schubert, M. Zimek, and A. , Angle-Based Outlier Detection, Proc. ACM SIGKDD, 2008.

M. , R. Zamar, and R. , Robust estimates of location and dispersion for highdimensional data sets, Technometrics, vol.44, issue.4, pp.307-317, 2002.

P. , S. Kitagawa, H. Gibbons, P. B. Faloutsos, and C. , LOCI: Fast outlier detection using the local correlation integral, Tech. Rep. Intel Research Lab, 2002.

P. , D. Prieto, and F. , Multivariate outlier detection and robust covariance matrix estimation, Technometrics, vol.43, issue.3, pp.286-310, 2001.

R. , S. Rastogi, R. Kyuseok, and S. , Efficient algorithms for mining outliers from large data sets, Proc. ACM SIGMOD, pp.427-438, 2000.

R. , P. J. Driessen, and K. V. , A fast algorithm for the minimum covariance determinant estimator, Technometrics, vol.41, issue.3, pp.212-223, 1999.

R. , P. J. Van, Z. , and B. C. , Unmasking Multivariate Outliers and Leverage Points, Journal of the American Statistical Association, vol.85, pp.633-639, 1990.

S. J. and C. N. , Kernel methods for pattern analysis, 2005.

S. , M. Chen, S. Sarinnapakorn, K. Chang, and L. , A novel anomaly detection scheme based on principal component classifier, Proc. ICDM 20003, pp.353-365, 2003.

S. , L. Han, W. , Y. , S. Zou et al., Continuous adaptive outlier detection on distributed data streams, HPCC, LNCS 4782, pp.74-85, 2007.

S. , S. Palpanas, T. Papadopoulos, D. Kalogeraki, V. Gunopulos et al., Online outlier detection in sensor data using non-parametric models, Proc. VLDB, pp.187-198, 2006.

T. , J. Chen, Z. Fu, A. W. Cheung, and D. W. , Enhancing Effectiveness of Outlier Detections for Low Density Patterns, Proc. PAKDD 2002. LNAI 2336, 2002.

Z. , J. Gao, Q. Wang, and H. , Spot: A system for detecting projected outliers from high-dimensional data streams, Proc. ICDE, pp.1628-1631, 2008.

R. References, M. Values, A. , E. Rodriguez, and C. , The treatment of missing values and its effect in the classifier accuracy. Classification, Clustering and Data Mining Applications, pp.639-648, 2004.

B. G. Monard and M. C. , An analysis of four missing data treatment methods for supervised learning, Applied Artificial Intelligence, vol.17, pp.519-533, 2003.

D. , A. P. Laird, N. M. Rubin, and D. B. , Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society, vol.39, pp.1-38, 1977.

F. Wenfei, G. , F. Farhangfar, A. Kurgan, L. Dy et al., Relative Information Completeness Impact of imputation of missing values on classification error for discrete data, Pattern Recognition, vol.41, issue.09, pp.3692-3705, 2008.

F. , H. A. Chen, G. C. Yin, C. D. Yang, B. B. Chen et al., A SVM regression based approach to filling in missing values, Knowledge-Based Intelligent Information and Engineering Systems (KES05). LNCS 3683, pp.581-587, 2005.

H. , M. , P. , and J. , Cleaning Disguised Missing Data A Heuristic Approach, Proc. KDD, 2007.

L. , D. Deogun, J. Spaulding, and W. , Towards Missing Data Imputation: A Study of Fuzzy K-means Clustering Method. Rough Sets and Current Trends in Computing, LNCS, vol.3066, 2004.

L. , R. J. Rubin, and D. B. , Statistical Analysis with Missing Data, Population (French Edition), vol.43, issue.6, 1987.
DOI : 10.2307/1533221

K. Mc, P. E. Figueredo, A. J. Sidani, and S. , Missing Data A Gentle Introduction, 2007.

P. and R. K. , The problem of disguised missing data, SIGKDD Explorations, vol.8, issue.1, pp.83-92, 2006.

S. and J. L. , Analysis of Incomplete Multivariate Data, 1997.

T. , H. Doring, C. Kruse, and R. , Different approaches to fuzzy clustering of incomplete datasets, International Journal of Approximate Reasoning, vol.35, 2003.

W. , C. Wun, C. Chou, and H. , Using association rules for completing missing data References Missing Values Allison, Proc. Hybrid Intelligent Systems Missing Data: Series: Quantitative Applications in the Social Sciences. Thousand Oaks, pp.236-241, 2002.

Y. C. Yuan, Multiple imputation for missing data: concepts and new development, Proceedings of the Twenty-fifth Annual SAS Users Group International Conference. SAS Institute, 2000.

P. D. Allison, Multiple Imputation for Missing Data, Sociological Methods & Research, vol.87, issue.3, pp.301-309, 2000.
DOI : 10.1080/01621459.1986.10478280