IJMTES – ON TRAFFIC-AWARE PARTITION AND AGGREGATION IN MAPREDUCE FOR BIG DATA APPLICATIONS

Journal Title : International Journal of Modern Trends in Engineering and Science

Author’s Name : S.Deepa | K.Dhanusha | K.Revathi  unnamed

Volume 03 Issue 07 2016

ISSN no:  2348-3121

Page no: 27-31

Abstract – The MapReduce encoding models simplifies large-scale data processing on commodity cluster by exploiting parallel map tasks and reduce tasks. Although many hard work have been made to improve the recital of MapReduce jobs, they ignore the network traffic generated in the mix up phase, which plays a critical role in recital augmentation. Traditionally, a hash function is used to partition in-between data among reduce tasks, which, however, is not traffic-efficient because network topology and data size connected with each key are not taken into consideration. In this paper, we study to decrease network traffic cost for a MapReduce job by designing a novel intermediate data partition scheme. Furthermore, we jointly consider the aggregator position problem, where each aggregator can reduce merged traffic from multiple map tasks. A decomposition-based dispersed algorithm is proposed to deal with the large-scale optimization problem for big data function and an online algorithm is also designed to adjust data partition and aggregation in a dynamic manner. Finally, wide simulation results demonstrate that our proposals can significantly decrease network traffic cost under both offline and online cases. 

Keywords— Map Reduce, Network Cost, Partition, Online Algorithm, Hash Function 

Reference

  1. J. Dean and S. Ghemawat, “Mapreduce: simplified data process-ing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
  2. W. Wang, K. Zhu, L. Ying, J. Tan, and L. Zhang, “Map task scheduling in mapreduce with data locality: Throughput and heavy-traffic optimality,” in INFOCOM, 2013 Proceedings IEEE. IEEE, 2013, pp. 1609–1617.
  3. F. Chen, M. Kodialam, and T. Lakshman, “Joint scheduling of pro-cessing and shuffle phases in mapreduce systems,” in INFOCOM, 2012 Proceedings IEEE. IEEE, 2012, pp. 1143–1151.
  4. Y. Wang, W. Wang, C. Ma, and D. Meng, “Zput: A speedy data uploading approach for the hadoop distributed file system,” in Cluster Computing (CLUSTER), 2013 IEEE International Conference on. IEEE, 2013, pp. 1–5.
  5. T. White, Hadoop: the definitive guide: the definitive guide. ” O’Reilly Media, Inc.”, 2009.
  6. S. Chen and S. W. Schlosser, “Map-reduce meets wider varieties of applications,” Intel Research Pittsburgh, Tech. Rep. IRP-TR-08-05, 2008.
  7. J. Rosen, N. Polyzotis, V. Borkar, Y. Bu, M. J. Carey, M. Weimer, T. Condie, and R. Ramakrishnan, “Iterative mapreduce for large scale machine learning,” arXiv preprint arXiv:1303.3517, 2013.
  8. S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, and R. S. Schreiber, “Presto: distributed machine learning and graph pro-cessing with sparse matrices,” in Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 2013, pp. 197– 210.
  9. A. Matsunaga, M. Tsugawa, and J. Fortes, “Cloudblast: Combin-ing mapreduce and virtualization on distributed resources for bioinformatics applications,” in eScience, 2008. eScience’08. IEEE Fourth International Conference on. IEEE, 2008, pp. 222–229.
  10. J. Wang, D. Crawl, I. Altintas, K. Tzoumas, and V. Markl, “Com-parison of distributed data-parallelization patterns for big data analysis: A bioinformatics case study,” in Proceedings of the Fourth International Workshop on Data Intensive Computing in the Clouds (DataCloud), 2013.
  11. R. Liao, Y. Zhang, J. Guan, and S. Zhou, “Cloudnmf: A mapreduce implementation of nonnegative matrix factorization for large-scale biological datasets,” Genomics, proteomics & bioinformatics, vol. 12, no. 1, pp. 48–51, 2014.
  12. G. Mackey, S. Sehrish, J. Bent, J. Lopez, S. Habib, and J. Wang, “Introducing map-reduce to high end computing,” in Petascale Data Storage Workshop, 2008. PDSW’08. 3rd. IEEE, 2008, pp. 1–6.
  13. W. Yu, G. Xu, Z. Chen, and P. Moulema, “A cloud computing based architecture for cyber security situation awareness,” in Communications and Network Security (CNS), 2013 IEEE Conference on. IEEE, 2013, pp. 488–492.
  14. J. Zhang, H. Zhou, R. Chen, X. Fan, Z. Guo, H. Lin, J. Y. Li,W. Lin, J. Zhou, and L. Zhou, “Optimizing data shuffling in data-parallel computation by understanding user-defined functions,” in Proceedings of the 7th Symposium on Networked Systems Design and Implementation (NSDI), San Jose, CA, USA, 2012.
  15. F. Ahmad, S. Lee, M. Thottethodi, and T. Vijaykumar, “Mapreduce with communication overlap,” pp. 608–620, 2013.
  16. H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker, “Map-reduce-merge: simplified relational data processing on large clus-ters,” in Proceedings of the 2007 ACM SIGMOD international con-ference on Management of data. ACM, 2007, pp. 1029–1040.
  17. T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, J. Gerth,J. Talbot, K. Elmeleegy, and R. Sears, “Online aggregation and continuous query support in mapreduce,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010, pp. 1115–1118.
  18. A. Blanca and S. W. Shin, “Optimizing network usage in mapre-duce scheduling.”
  19. B. Palanisamy, A. Singh, L. Liu, and B. Jain, “Purlieus: locality-aware resource allocation for mapreduce in a cloud,” in Proceed-ings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2011, p. 58.
  20. S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, and L. Qi, “Leen: Locality/fairness-aware key partitioning for mapreduce in the cloud,” in Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on. IEEE, 2010, pp. 17– 24.
  21. L. Fan, B. Gao, X. Sun, F. Zhang, and Z. Liu, “Improving the load balance of mapreduce operations based on the key distribution of pairs,” arXiv preprint arXiv:1401.0355, 2014.
  22. S.-C. Hsueh, M.-Y. Lin, and Y.-C. Chiu, “A load-balanced mapre-duce algorithm for blocking-based entity-resolution with multiple keys,” Parallel and Distributed Computing 2014, p. 3, 2014.
  23. T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears, “Mapreduce online.” in NSDI, vol. 10, no. 4, 2010, p. 20.
  24. J. Lin and C. Dyer, “Data-intensive text processing with mapre-duce,” Synthesis Lectures on Human Language Technologies, vol. 3, no. 1, pp. 1–177, 2010.
  25. P. Costa, A. Donnelly, A. I. Rowstron, and G. O’Shea, “Camdoop: Exploiting in-network aggregation for big data applications.” in NSDI, vol. 12, 2012, pp. 3–3.