A good MapReduce algorithms reference book!
I’d like to review “MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems” book I have recently read.
This book describes a list of MapReduce algorithms (authors call them “patterns”) as well as a few ways to create the more complex algorithms out of these building blocks. Algorithms were divided into the following groups:
- Summarization – Numerical summarization (sum, min, max, count, avg, median, std deviation); Inverted index summarization; Counting with counters (limited number of counters)
- Filtering – Simple filter; Bloom filter; Top N filter; Distinct filter
- Data organization – Structured to hierarchical; Partitioning; Binning; Total order sorting; Shuffling
- Join patterns – Reduce side join; Replicated join; Composite join; Cartesian product
This book describes how to implement these algorithms on pure Hadoop using pure Java. Authors of this book suggest that such skill still should be required for 10% most complicated tasks not covered by frameworks like Pig.
A lot of reviewers have already pointed out that examples in this book were not tested. There is a number of mistakes (but not serious ones) in Java listings.
I want to suggest not paying to much attention to these mistakes – do not treat this book as your first Hadoop textbook. Use this book as a reference of MapReduce algorithms instead – read text part of a book instead of source codes! It describes all these algorithms in sufficient depth (Bloom filter description may be an exception) in order to understand how they operate even without prior Hadoop/MapReduce knowledge.
This book may also be useful for interview preparation to the companies using BigData solutions (like Google) – it describes the algorithms, and the algorithms knowledge is usually required on such interviews.