Journal Title : International Journal of Modern Trends in Engineering and Science
Paper Title : BRINGING SIZE BASED SCHEDULING TO HADOOP
Volume 04 Issue 03 2017
ISSN no: 2348-3121
Page no: 100-102
Abstract – Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a Parallel and distributed computing environment. Hadoop is the widely used big data processing engine with a simple master slave setup. Big Data in most companies are processed by Hadoop by submitting the jobs to Master. The Master distributes the job to its cluster and process map and reduce tasks sequentially. But now days the growing data need and the competition between Service Providers leads to the increased submission of jobs to the Master. This Concurrent job submission on Hadoop forces us to do Scheduling on Hadoop Cluster so that the response time will be acceptable for each job. There are mainly two different strategies used to schedule jobs in a cluster i.e. (PS & FCFS). The first strategy is to split the cluster resources equally among all the running jobs and this strategy in Hadoop is called Hadoop Fair Scheduler. The second strategy is to serve one job at a time, thus avoiding the resource splitting. Size-based scheduling requires a prior job size information, which is not available in Hadoop: HFSP builds such knowledge by estimating it on-line during job execution. We present the design of a new scheduling protocol that caters both to a fair and efficient utilization of cluster resources, while striving to achieve short response times. Here we use two more scheduling process; they are resource sharing and Quad Scheduling. Our solution implements a size-based, preemptive scheduling discipline. The scheduler allocates cluster resources such that job size information is inferred while the job makes progress toward its completion. Scheduling decisions use the concept of virtual time and cluster resources are focused on jobs according to their priority, computed through aging. This ensures that neither small nor large jobs suffer from starvation. The Shortest Remaining Processing Time (SRPT) policy, which prioritizes jobs that need the least amount of work to complete, is the one that minimizes the mean response time (or sojourn time), that is the time that passes between a job submission and its completion. We Extend HFSP to pause jobs with Higher SRPT and allow other waiting jobs in Queue based on FCFS.
- Apache, “Hadoop: Open source implementation of Map Reduce,” http: //hadoop.apache.org/.
- J. Dean and S. Ghemawat, “Map Reduce: Simplified data processing on large clusters,” in Proc. of USENIX OSDI, 2004.
- Apache, “Spark,” http://spark.apache.org/
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, 2012, pp. 2–2.
- Microsoft, “The naiad system,” https://github.com/ Microsoft Research SVC/naiad.
- D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi, “Naiad: A timely data flow system,” in Proceedings of the 24th ACM Symposium on Operating Systems Principles, 2013, pp. 439– 455.
- Y. Chen, S. Alspaugh, and R. Katz, “Interactive query processing in big data systems: A cross-industry study of Map Reduce workloads,” in Proc. of VLDB, 2012.
- K. Ren et al., “Hadoop’s adolescence: An analysis of Hadoop usage in scientific workloads,” in Proc. of VLDB, 2013.
- G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica, “Effective straggler mitigation: Attack of the clones.” in NSDI, vol. 13, 2013.
- Apache, “Oozie Workflow Scheduler,” http://oozie.apache.org/.