Understanding MapReduce And Tez

Writing Hadoop MapReduce code in Java is the lowest level way to program against a Hadoop cluster. Hadoop’s libraries do not contain any abstractions, like Spark RDDs or a Hive or Pig-like higher level language. All code must implement the MapReduce paradigm.

This video provides a great introduction to MapReduce. This documentation provides a written explanation and an example.

However, some of our tools, mainly Hive and Pig, run on Tez rather than MapReduce. Some key points about the differences between Tez and Spark (with much credit to this post):

How Tez works (very similar to Spark):

Tez is a DAG (Directed acyclic graph) architecture.

1. Execute the plan but no need to read data from disk.

2. Once ready to do some calculations (similar to actions in spark), get the data from disk and perform all steps and produce output.

Notice the efficiency introduced by not going to disk multiple times. Intermediate results are stored in memory (not written to disks). On top of that there is vectorization (process batch of rows instead of one row at a time). All this adds to efficiencies in query time.

Only one read and one write.

How Mapreduce works:

1. Read data from file –>one disk access

2. Run mappers

3. Write map output –> second disk access

4. Run shuffle and sort –> read map output, third disk access

5. write shuffle and sort –> write sorted data for reducers –> fourth disk access

6. Run reducers which reads sorted data –> fifth disk output

7. Write reducers output –>sixth disk access

HIGH PERFORMANCE COMPUTING

Software

ARC Storage Services

U-M Resources for Researchers

ARC Cloud Services

Other Cloud Services

Resource Management Portal

Leaving U-M