hide
Free keywords:
-
Abstract:
Over the past few decades, there is a multifold increase in the amount of
digital data that is being generated. Various attempts are being made to
process this vast amount of data in a fast and efficient manner. Hadoop -
MapReduce is one such software framework that has gained popularity in the last
few years. It provides a reliable and easier way to process huge amount of data
in-parallel on large computing cluster. However, Hadoop always persists
intermediate results to the local disk. As a result, Hadoop usually suffers
from long execution runtimes as it typically pays a high I/O cost for running
jobs.
The state-of-the-art computing clusters have enough main memory capacity to
hold terabytes of data in main memory. We have built M3R (Main Memory
MapReduce) framework, a prototype for generic main memory-based data
processing. M3R can execute MapReduce jobs and also in addition it can execute
general data processing jobs.
This master thesis in particular, focuses on countering the data-skewness
problem for MapReduce jobs on M3R. Intermediate data following skewed
distribution could lead to computational imbalance amongst the reduce tasks,
resulting in longer MapReduce job execution times. This provides a scope for
rebalancing the intermediate data and thereby reducing the total job runtimes.
We propose a novel dynamic approach of data rebalancing, to counter the reducer
side data skewness. Our proposed on-the-fly skew countering approach, attempts
to detect the level of skewness in the intermediate data and rebalances the
intermediate data amongst the reduce tasks. The proposed mechanism performs all
the skew-countering related activities during the execution of actual MapReduce
job. We have implemented this reduce side skew countering mechanism as a part
of the M3R framework. The experiments conducted to study the behavior of this
M3R data-rebalancing approach shows there is a significant reduction in the
map-reduce job runtimes. In case of the data-skewed input, our proposed
skew-control approach for M3R has reduced the total map-reduce job runtime (up
to 31 ) when compared to M3R without skew-control.