Word Count Problem Workflow in MapReduce
Verified Concept Article • Factual Traceability Enabled
Summary OverviewThe Word Count problem in MapReduce processes input text by mapping each token to a <word,1> pair and reducing these pairs through summation to produce total word frequencies.
Home > hadoop > hadoop/mapreduce-word-count-application-in-hadoop" class="text-[#6b38d4] font-semibold hover:underline">MapReduce Word Count Application in Hadoop</a> > hadoop/word-count-problem-workflow-in-mapreduce" class="text-[#6b38d4] font-semibold hover:underline">Word Count Problem Workflow in MapReduce</a>
Detailed Explanation
The Word Count problem is a canonical example used to illustrate the MapReduce programming model within the Hadoop ecosystem. In this workflow, the input data set—typically a large text file such as ex1.txt—is partitioned into splits that are processed in parallel by mapper tasks. Each mapper reads one line of the file, tokenizes the line into individual words using a string tokenizer, and emits a key‑value pair for each token where the key is the word itself and the value is the integer literal 1. These intermediate pairs are serialized as Hadoop writable types, specifically Text for the word and IntWritable for the count, as indicated by the imported classes in the reference code.
After the mapping phase, the Hadoop runtime performs a shuffle and sort operation that groups all values associated with the same key across the distributed mappers. The reducer receives a key together with an iterable collection of its associated values, represented conceptually as <word, [1,2,1…]>. The reducer’s responsibility is to aggregate these values, which in the Word Count case means summing the integers to compute the total occurrences of each distinct word. The summation is a simple associative and commutative operation, allowing Hadoop to parallelize the reduction safely. The final output of the reducer is a persistent key‑value pair <word, totalCount> written to the designated output directory via FileOutputFormat.
The overall MapReduce job is configured through a Job object that specifies the mapper and reducer classes, the input and output formats, and the paths for the source file and result directory. The job orchestrates the allocation of mapper and reducer containers across the cluster, handling fault tolerance by re‑executing failed tasks and ensuring deterministic results.
Examples
Consider the line "Hadoop enables scalable data processing" contained in ex1.txt. The mapper tokenizes this line into the tokens "Hadoop", "enables", "scalable", "data", and "processing". For each token it emits a pair such as `<Hadoop, 1>, <enables, 1>`, and so forth. Suppose another mapper processes a different line "Hadoop provides fault tolerance" and emits `<Hadoop, 1> again along with other word pairs. During the shuffle phase, all <Hadoop, 1> pairs from both mappers are grouped, yielding an input to the reducer of <Hadoop, [1,1]>. The reducer sums the values to produce <Hadoop, 2>`, indicating that the word "Hadoop" appears twice in the combined input. This same process repeats for each distinct word, ultimately generating a complete frequency table of the corpus. The Java source excerpt demonstrates the required imports—java.io.IOException, java.util.StringTokenizer, Hadoop configuration and I/O classes, and the MapReduce API—providing the scaffolding necessary to implement the mapper and reducer logic described above.
Related Topics
Incoming Backlinks
Other pages in this wiki that link back to the current topic.