MapReduce is a programming model for parallel programming, and a proprietary execution engine (open source engines like Hadoop and open source successors like Beam also exist). It is loosely based on
reduce() higher order functions from functional programming, but refers to a very specific algorithm that uses those two higher order functions exactly once each.
A MapReduce program has three phases: A single execution of
map() with a user-provided mapper function that runs across a large data set (which might not even fit on a single computer) in parallel. The mapper function returns a partition key that determines which computer will run the rest of the computation. The second phase is a shuffle step which tends to forward elements with the same partition key to the same computer. The third phase is multiple executions of
reduce() with a user-provided reducer function, one reduce call per unique partition key returned by the mapper. If MapReduce had a serial implementation, it would always look something like:
def map_reduce(mapper_function, reducer_function, reducer_init, input_list):
partition_keys = itertools.imap(mapper_function, input_list)
kvs = itertools.izip(partition_keys, input_list)
shuffled_kvs = sorted(kvs)
for key, values in itertools.groupby(shuffled_kvs):
yield key, reduce(reducer_function, values, reducer_init)
The advantage of always structuring your programs this way is that rather than running serially and being limited to the size of a single computer like the pseudocode above, the execution engine can execute all three phases in parallel across multiple machines, and since both the mapper function and reducer function are pure functions, if computers crash or fail to return results fast enough, the execution engine knows exactly which pieces of work can be sent to a second computer to complete the overall final result, giving fault tolerance without the application programmer needing to design the fault tolerance explicitly.