IJMTES – RECORD DEDUPLICATION USING ARTIFICIAL BEE COLONY ALGORITHM

Journal Title : International Journal of Modern Trends in Engineering and Science

Paper Title : RECORD DEDUPLICATION USING ARTIFICIAL BEE COLONY ALGORITHM

Author’s Name : N Anandunnamed

Volume 04 Issue 03 2017

ISSN no:  2348-3121

Page no: 42-46

Abstract – The objective of this paper is to eliminate the duplicated entries in the repositories. Due to integration of multiple databases the data gets replicated which occupies more storage area and reduce the through put. Here we use ABC (Artificial Bee Colony) algorithm which is considered new and widely used in searching for optimum solutions. The solution for a problem emerges from intelligent behavior of honeybee swarms. Based on the similarity measure generated by ABC algorithm the duplicated entries are identified and eliminated. Data set generators, Cora and restaurant data sets are used.

KeywordsABC Algorithm, Database Integration, Data Set

Reference

  1. Ahmed, K. Elmagarmid, Panagiotis G. Ipeirotis and Vassilios S. Verykios, 2007. Duplicate record detection:  A  survey.  IEEE  Trans.  Knowl.  Data Eng., 19: 1-16. DOI: 10.1109/TKDE.2007.250581
  2. Bhagwat, D., K. Eshghi, D.D. Long and M. Lillibridge, 2009. Extreme binning: Scalable,parallel deduplication      for chunk-based file backup. Proceedings of the 17th IEEE/ACM International Symposium    on Modelling, Analysis and Simulation of Computer and Telecommunication Systems,  (MASCOTS ’09), London, UK.
  3. Bolosky, W.J., S. Corbin, D. Goebel and J.R. Douceur,2000. Single instance storage in Windows® 2000. Proceedings  of  the  4th  Conference  on  USENIX Windows Systems Symposium,    (WSS ’00), USENIX Association Berkeley, CA, USA, pp: 2-2.
  4. Douceur, J.R., A. Adya, W.J. Bolosky, D. Simon and M.   Theimer,   2002.   Reclaiming   space   from duplicate   files   in a   server less   distributed   file system.  Proceedings of  the  22nd  International Conference on Distributed Computing Systems, (ICDCS’ 02), ACM, USA., pp: 617-617.
  5. Dubnicki, C., L. Gryz, L. Heldt, M. Kaczmarczyk and W. Kilian et al., 2009. Hydrastor: A scalable secondary storage. Proceedings of the 7th Conference on File  and  Storage  Technologies,(FST ‘09), pp: 197-210.
  6. Dutch, T.M. and W.J. Bolosky, 2011. A study of practical deduplication. ACM Trans. Storage. DOI:10.1145/2078861.2078864.
  7. Ektefa, M.,  F.  Sidi, H.  Ibrahim, M.A.  Jabar  and S.Memar et al., 2011. A threshold-based similarity measure for duplicate detection. Proceedings of the IEEE Conference on Open Systems, Sept. 25-28, IEEE Xplore Press,  Langkawi,  pp:  37-41.  DOI: 10.1109/ICOS.2011.6079233
  8. Elhadi, M. and A. Al-Tobi, 2009. Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures. Proceedings of the 4th International Conference on Computer    Sciences and Convergence Information Technology, Nov. 24-26, IEEE Xplore  Press,  Seoul,  pp:  679-684.  DOI:10.1109/ICCIT.2009.235
  9. Gunawi,H.S., N. Agrawal, A.C. Arpaci-Dusseau,R.H.Arpaci-Dusseau   andJ.Schindler,2005.Deconstructing commodity storage clusters. Proceedings of the 32nd Annual International Symposium on Computer  Architecture, Jun. 4-8, IEEE Xplore      Press,    pp:60-71.DOI:10.1109/ISCA.2005.20
  10. Haidarian, S., C. Shahri, B. Lucas and N. Araabi, 2006.Identifying duplicate records by using estimation of distribution algorithms to learn the semantics. Proceedings    of the11thInternational CSI Computer Conference (CSICC ‘06), Tehran.
  11. Harnik, D., B.  Pinkas and  A.  Shulman-Peleg, 2010.Side channels in cloud services: Deduplication in cloud storage, IEEE Security Privacy, 8: 40-47. DOI: 10.1109/MSP.2010.187
  12. Karaboga, D. and C. Ozturk, 2010. Fuzzy clustering with artificial bee colony algorithm. Sci. Res. Essays, 5: 1899-1902.
  13. Kumar, J.P. and P. Govindarajulu, 2009. Duplicate and near duplicate documents detection: A review. Eur. J. Sci. Res., 32: 514-527.
  14. Kumbhar, P.Y. and P.S. Krishnan, 2011. Use of Artificial Bee Colony (ABC) algorithm in artificial neural network synthesis. Int. J. Adv. Eng. Sci. Technol., 11: 162-171.
  15. Lillibridge, M., K. Eshghi, D. Bhagwat, V. Deolalikar and G. Trezise et al., 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. Proceedings     of the 7thUSENIX Conference on File and Storage Technologies, (FAST ‘09), USENIX Association, pp: 111-123.
  16. Moises,   G.,   D.   Carvalho,   H.F.A.   Laender,   M.A.Goncalves and A.S.D. Silva, 2011. A genetic programming approach to record deduplication. IEEE Trans. Knowl. Data Eng., 24: 399-412. DOI:10.1109/TKDE.2010.234
  17. Muthitacharoen, A., B. Chen and D. Mazieres, 2001. A low-bandwidth network file system. Proceedings of the 18th ACM Symposium on Operating Systems Principles, Oct. 21-24, ACM Press, Banff, Canada, pp: 174-187. DOI: 10.1145/502034.502052.
  18. Qingwei, Y., W. Dongxing, Z. Yu and W. Xiaodong,2010. The duplicated of partial content detection based on  PSO.  Proceedings  of  the  IEEE  5thInternational Conference    on    Bio-Inspired Computing: Theories and Applications, Sept. 23-26,  IEEE  Xplore Press,Changsha,pp:350-353.DOI: 10.1109/BICTA.2010.5645302
  19. Quinlan,  S.  and  S.  Dorward,  2002.  Venti:  A  new approach  to  archival  storage.  Bell  Labs,  Lucent Technologies.
  20. Rhea,   S.,   R.   Cox   and   A.   Pesterev,   2008.   Fast, inexpensive content-addressed    storage in foundation. Proceedings of the Annual Technical Conference   on   Annual   Technical   Conference, (ATC’ 08), ACM Press, USA., pp: 143-156.
  21. Samanta,  S.  and  S.  Chakraborty,  2011.  Parametric optimization of  some  non-traditional  machining processes  using  artificial  bee  colony  algorithm. Eng.   Appli.   Art.   Intell.,24:   946-957.   DOI:10.1016/j.engappai.2011.03.009
  22. Tan, Y., H. Jiang, D. Feng, L. Tian and Z. Yan et al.,2010. SAM: A semantic-aware multi-tiered source deduplication framework for cloud backup. Proceedings of the 39th International Conference on Parallel Processing, Sept. 13-16, IEEE  Xplore Press,   San   Diego, CA.,   pp:   614-623.   DOI10.1109/ICPP.2010.69
  23. Ungureanu, C., B. Atkin, A. Aranya, S. Gokhale and S.Rago et al., 2010. HydraFS: a high-throughput file system for the HYDRAstor content-addressable storage system. Proceedings of the 8th USENIX Conference on File and Storage Technologies, (FST’ 10), USENIX Association Berkeley, CA, USA., pp: 17-17.
  24. Vrable, M.,S. Savage and  G.M.  Voelker,  2009.Cumulus: File system backup to the cloud. ACM Trans. Storage. DOI: 10.1145/1629080.1629084
  25. Winkler,  W.E.,  2001  Record  linkage  software  and methods  for  merging  administrative  lists.  The Pennsylvania State University.
  26. Zhu, B., K. Li and H. Patterson, 2008. Avoiding the disk bottleneck in the data domain deduplication file system. Proceedings of the 6th USENIX Conference on File and Storage Technologies, (FAST ‘08), USENIX Association Berkeley, USA.
Scroll Up