Journal Title : International Journal of Modern Trends in Engineering and Science
Paper Title : RECORD DEDUPLICATION USING ARTIFICIAL BEE COLONY ALGORITHM
Volume 04 Issue 03 2017
ISSN no: 2348-3121
Page no: 42-46
Abstract – The objective of this paper is to eliminate the duplicated entries in the repositories. Due to integration of multiple databases the data gets replicated which occupies more storage area and reduce the through put. Here we use ABC (Artificial Bee Colony) algorithm which is considered new and widely used in searching for optimum solutions. The solution for a problem emerges from intelligent behavior of honeybee swarms. Based on the similarity measure generated by ABC algorithm the duplicated entries are identified and eliminated. Data set generators, Cora and restaurant data sets are used.
Keywords – ABC Algorithm, Database Integration, Data Set
- Ahmed, K. Elmagarmid, Panagiotis G. Ipeirotis and Vassilios S. Verykios, 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19: 1-16. DOI: 10.1109/TKDE.2007.250581
- Bhagwat, D., K. Eshghi, D.D. Long and M. Lillibridge, 2009. Extreme binning: Scalable,parallel deduplication for chunk-based file backup. Proceedings of the 17th IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems, (MASCOTS ’09), London, UK.
- Bolosky, W.J., S. Corbin, D. Goebel and J.R. Douceur,2000. Single instance storage in Windows® 2000. Proceedings of the 4th Conference on USENIX Windows Systems Symposium, (WSS ’00), USENIX Association Berkeley, CA, USA, pp: 2-2.
- Douceur, J.R., A. Adya, W.J. Bolosky, D. Simon and M. Theimer, 2002. Reclaiming space from duplicate files in a server less distributed file system. Proceedings of the 22nd International Conference on Distributed Computing Systems, (ICDCS’ 02), ACM, USA., pp: 617-617.
- Dubnicki, C., L. Gryz, L. Heldt, M. Kaczmarczyk and W. Kilian et al., 2009. Hydrastor: A scalable secondary storage. Proceedings of the 7th Conference on File and Storage Technologies,(FST ‘09), pp: 197-210.
- Dutch, T.M. and W.J. Bolosky, 2011. A study of practical deduplication. ACM Trans. Storage. DOI:10.1145/2078861.2078864.
- Ektefa, M., F. Sidi, H. Ibrahim, M.A. Jabar and S.Memar et al., 2011. A threshold-based similarity measure for duplicate detection. Proceedings of the IEEE Conference on Open Systems, Sept. 25-28, IEEE Xplore Press, Langkawi, pp: 37-41. DOI: 10.1109/ICOS.2011.6079233
- Elhadi, M. and A. Al-Tobi, 2009. Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures. Proceedings of the 4th International Conference on Computer Sciences and Convergence Information Technology, Nov. 24-26, IEEE Xplore Press, Seoul, pp: 679-684. DOI:10.1109/ICCIT.2009.235
- Gunawi,H.S., N. Agrawal, A.C. Arpaci-Dusseau,R.H.Arpaci-Dusseau andJ.Schindler,2005.Deconstructing commodity storage clusters. Proceedings of the 32nd Annual International Symposium on Computer Architecture, Jun. 4-8, IEEE Xplore Press, pp:60-71.DOI:10.1109/ISCA.2005.20
- Haidarian, S., C. Shahri, B. Lucas and N. Araabi, 2006.Identifying duplicate records by using estimation of distribution algorithms to learn the semantics. Proceedings of the11thInternational CSI Computer Conference (CSICC ‘06), Tehran.
- Harnik, D., B. Pinkas and A. Shulman-Peleg, 2010.Side channels in cloud services: Deduplication in cloud storage, IEEE Security Privacy, 8: 40-47. DOI: 10.1109/MSP.2010.187
- Karaboga, D. and C. Ozturk, 2010. Fuzzy clustering with artificial bee colony algorithm. Sci. Res. Essays, 5: 1899-1902.
- Kumar, J.P. and P. Govindarajulu, 2009. Duplicate and near duplicate documents detection: A review. Eur. J. Sci. Res., 32: 514-527.
- Kumbhar, P.Y. and P.S. Krishnan, 2011. Use of Artificial Bee Colony (ABC) algorithm in artificial neural network synthesis. Int. J. Adv. Eng. Sci. Technol., 11: 162-171.
- Lillibridge, M., K. Eshghi, D. Bhagwat, V. Deolalikar and G. Trezise et al., 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. Proceedings of the 7thUSENIX Conference on File and Storage Technologies, (FAST ‘09), USENIX Association, pp: 111-123.
- Moises, G., D. Carvalho, H.F.A. Laender, M.A.Goncalves and A.S.D. Silva, 2011. A genetic programming approach to record deduplication. IEEE Trans. Knowl. Data Eng., 24: 399-412. DOI:10.1109/TKDE.2010.234
- Muthitacharoen, A., B. Chen and D. Mazieres, 2001. A low-bandwidth network file system. Proceedings of the 18th ACM Symposium on Operating Systems Principles, Oct. 21-24, ACM Press, Banff, Canada, pp: 174-187. DOI: 10.1145/502034.502052.
- Qingwei, Y., W. Dongxing, Z. Yu and W. Xiaodong,2010. The duplicated of partial content detection based on PSO. Proceedings of the IEEE 5thInternational Conference on Bio-Inspired Computing: Theories and Applications, Sept. 23-26, IEEE Xplore Press,Changsha,pp:350-353.DOI: 10.1109/BICTA.2010.5645302
- Quinlan, S. and S. Dorward, 2002. Venti: A new approach to archival storage. Bell Labs, Lucent Technologies.
- Rhea, S., R. Cox and A. Pesterev, 2008. Fast, inexpensive content-addressed storage in foundation. Proceedings of the Annual Technical Conference on Annual Technical Conference, (ATC’ 08), ACM Press, USA., pp: 143-156.
- Samanta, S. and S. Chakraborty, 2011. Parametric optimization of some non-traditional machining processes using artificial bee colony algorithm. Eng. Appli. Art. Intell.,24: 946-957. DOI:10.1016/j.engappai.2011.03.009
- Tan, Y., H. Jiang, D. Feng, L. Tian and Z. Yan et al.,2010. SAM: A semantic-aware multi-tiered source deduplication framework for cloud backup. Proceedings of the 39th International Conference on Parallel Processing, Sept. 13-16, IEEE Xplore Press, San Diego, CA., pp: 614-623. DOI10.1109/ICPP.2010.69
- Ungureanu, C., B. Atkin, A. Aranya, S. Gokhale and S.Rago et al., 2010. HydraFS: a high-throughput file system for the HYDRAstor content-addressable storage system. Proceedings of the 8th USENIX Conference on File and Storage Technologies, (FST’ 10), USENIX Association Berkeley, CA, USA., pp: 17-17.
- Vrable, M.,S. Savage and G.M. Voelker, 2009.Cumulus: File system backup to the cloud. ACM Trans. Storage. DOI: 10.1145/1629080.1629084
- Winkler, W.E., 2001 Record linkage software and methods for merging administrative lists. The Pennsylvania State University.
- Zhu, B., K. Li and H. Patterson, 2008. Avoiding the disk bottleneck in the data domain deduplication file system. Proceedings of the 6th USENIX Conference on File and Storage Technologies, (FAST ‘08), USENIX Association Berkeley, USA.