Input Data Ingestion and Shuffle Phase in Hadoop MapReduce
Verified Concept Article • Factual Traceability Enabled
Summary OverviewThe input ingestion pipeline transforms HDFS files into mapper key‑value pairs, which are then shuffled, sorted, and merged across the network before reaching reducers.
Input Ingestion and Logical Splitting
The MapReduce job begins with client‑uploaded files stored in the Hadoop Distributed File System (HDFS). An InputFormat component interprets these files and defines logical input splits, each representing a contiguous block of data that a single mapper will process. The size of an input split typically matches the HDFS block size, and the total number of mappers is estimated by dividing the overall data volume by the split size (Source 2). This logical division enables parallelism while keeping data locality high, as each mapper is scheduled on a node that holds the corresponding HDFS block.
RecordReader: From Bytes to Key‑Value Pairs
For every input split, the RecordReader reads raw bytes and converts them into framework‑compatible key‑value pairs. The key often denotes the byte offset or line number, while the value holds the actual record content. This conversion abstracts file formats (text, binary, sequence files) and presents a uniform interface to the mapper logic (Source 2).
Mapper Phase: Generating Intermediate Data
The Mapper receives the key‑value pairs produced by the RecordReader and applies user‑defined business logic. It may emit zero, one, or many intermediate key‑value pairs per input record. The number of mapper tasks is therefore driven by the input split count, not by the volume of output they generate (Source 2). Intermediate records are temporarily buffered in memory before being handed off to the next stage.
Combiner and Partitioner: Local Aggregation and Key Distribution
Before data leaves the mapper node, an optional Combiner can perform a local reduction, aggregating values that share the same intermediate key. This step reduces the amount of data transferred across the network, alleviating shuffle traffic (Source 2). Simultaneously, the Partitioner determines the target reducer for each key by mapping the key space to a fixed number of partitions. All records with identical keys are guaranteed to be routed to the same reducer, ensuring correct aggregation.
Shuffle and Sort: Network‑Level Data Movement
The shuffle is the physical transfer of intermediate output from mappers to reducers over the cluster network. As reducers pull data from multiple mapper nodes, values associated with the same key from different mappers are merged. The framework then sorts these merged values by key, presenting each reducer with a sorted stream of key‑value groups (Source 1). This sorting facilitates efficient sequential processing and enables reducers to assume that all values for a given key are contiguous.
Reducer and Output Commit
Each Reducer processes the sorted key groups, performing the final aggregation or transformation defined by the user. After producing its output, the reducer delegates writing to a RecordWriter, which formats the results according to the job’s OutputFormat (e.g., text, Avro, Parquet). The RecordWriter selects the appropriate HDFS output directory, creates output splits, and commits the final files to HDFS (Source 1).
Through this tightly coupled ingestion‑to‑shuffle pipeline, Hadoop MapReduce achieves scalable, fault‑tolerant processing of massive datasets while minimizing network overhead and preserving data locality.
Related Topics
Incoming Backlinks
Other pages in this wiki that link back to the current topic.