Journal Title : International Journal of Modern Trends in Engineering and Science
Paper Title : SURVEY ON DETECTING DUPLICATES IN LARGE DATA SET
Author’s Name : Dr. R Priya | Jiji R
Volume 03 Issue 12 2016
ISSN no: 2348-3121
Page no: 37-40 ijmtes0312p09
Abstract – The process of maintaining the quality of the large data set is a tedious task nowadays. The duplicate data present in the data set affects the quality of processing and reduces the accuracy of analyzing. The algorithm of effective progressive for discovery of duplicates on huge dataset is the way towards recognizing undesirable documents in the archive in short time. It does not influence the record with no quality change furthermore it is utilized for cleaning process. It has boundless substantial information sets. The framework alarms the user about potential copies when the client tries to make new records or update existing records. To look after information quality, you can plan a copy recognition employment to check for duplicates for all records that match specific criteria. The information can get cleaned by erasing, deactivating, or combining the duplicates reported by copy recognition. This survey paper deals with various methods and strategies present in duplicate detection process in both little and expansive datasets. The methods like progressive sorted neighborhood and progressive blocking are used to distinguish the duplicates with less time of execution furthermore without aggravating the dataset quality.
Keywords — Data mining, Data cleaning, Duplication detection, Progressive method.
- Thorsten Papenbrock, ArvidHeise, and Felix Naumann, “Progressive duplicate detection”, IEEE transactions on knowledge and data engineering, 2015.
- Ramya C. Palaninehru, “A Study of Progressive Techniques for Efficient Duplicate Detection”, International Journal of Advanced Research in Computer Science and Software Engineering, 2015.
- E. Whang, D. Marmaros, and H. Garcia-Molina, “Pay-as-you-go entity resolution,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 5, pp. 1111–1124, May 2012.
- Xiao, W. Wang, X. Lin, and H. Shang“Top-k set similarity joins,” in Proc. IEEE Int. Conf. Data Eng., 2009, pp. 916–927.
- Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record detection: A survey,” IEEE Trans. Knowl. Data Eng., vol. 19,no. 1, pp. 1–16, Jan. 2007
- Draisbach, F. Naumann, S. Szott, and O. Wonneberg, “Adaptive windows for duplicate detection,” in Proc. IEEE 28th Int. Conf. Data Eng., 2012, pp. 1073–1083.
- Hassanzadeh, F. Chiang, H. C. Lee, and R. J. Miller, “Framework for evaluating clustering algorithms in duplicate detection,” Proc. Very Large Databases Endowment, vol. 2, pp. 1282–1293, 2009.
- Bilenko and R.J. Mooney, “Adaptive Duplicate Detection Using Learnable String Similarity Measures,” Proc. ACM SIGKDD, pp. 39-48, 2003.
- Christen, “Automatic Record Linkage Using Seeded Nearest Neighbour and Support Vector Machine Classification,” Proc. ACM SIGKDD, pp. 151-159, 2008.
- Draisbach and F. Naumann, “A generalization of blocking and windowing algorithms for duplicate detection,” in Proc. Int. Conf.Data Knowl. Eng., 2011, pp. 18–24.
- S. Warren, Jr., “A modification of Warshall’s algorithm for the transitive closure of binary relations,” Commun. ACM, vol. 18, no. 4, pp. 218–220, 1975.
- Wallace and S. Kollias, “Computationally efficient incremental transitive closure of sparse fuzzy binary relations,” in Proc. IEEE Int. Conf. Fuzzy Syst., 2004, pp. 1561–1565.
- Dr.M.Mayilvaganan, M.Saipriyanka “Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic algorithm “International Journal of Innovative Research in Computer and Communication Engineering Vol. 3, Issue 9, September 2015.
- Kille, F. Hopfgartner, T. Brodt, and T. Heintz, The Plista dataset, in Proc. Int. Workshop Challenge News Recommender Syst., 2013, pp. 16-23.
- L. Kolb, A. Thor, and E. Rahm, Parallel sorted neighborhood blocking with MapReduce, in Proc. Conf. Datenbanksysteme in Buro, Technik und Wissenschaft, 2011.