Tag Archives: algorithms

Book review: Programming Pearls (2nd edition)

A marvelous algorithm collection!

Today I’d like to review “Programming Pearls” (2nd Edition) book. You may think it is seriously out of date, because it was published in 1999, and you will be wrong.

This book has a list of hand picked real problems solved by various developers at times when RAM was counted in kilobytes and megabytes and CPU frequency in megahertz rather than in gigahertz. These algorithms are still very important, because current computer systems are limited by memory access times rather than by CPU frequency. It means that in many cases you may gain a serious performance boost by getting these algorithms out of your grandfather’s trunk 🙂

This book can not replace any of well-known algorithm text books like “Algorithms (4th Edition)” or “Introduction to Algorithms”. It describes particular quite counter-intuitive applications of these algorithms for some problems. And the most important difference of this book from algorithm textbooks – it is easy and interesting to read, so you are unlikely to get asleep while reading this book late night 🙂

This book describes algorithms in the following areas:

  • Numeric values sorting and searching, anagrams
  • Data structures – right choice of a structure for a problem
  • Program testing and verification
  • Performance tuning principles
  • Back of the envelope calculations (very useful for engineering interview – in Google, for example)
  • Algorithm design techniques
  • Code tuning
  • Squeezing space
  • Sorting
  • Searching
  • Heaps
  • Algorithms on strings (this is just an overview, this is an in-depth book in this area: “Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology”)

Here are a few problems discussed in this book:

  • How to sort up to 10 million unique non-negative integers, all of which are less than 107 in 1.25M memory? What if we have only 1M (or less) memory available? What if our integers are not unique, but number of occurrences of each value is limited?
  • Find all sets of anagrams in the given dictionary.
  • You have a file with 4 billion 32-bit integers. Find an integer which is not in the file. How would you do it if you have an ample amount of RAM? What about case when you have only a few hundred bytes of RAM, but you are allowed to write temporary files?
  • Given a vector of floating-point numbers, find a maximum sum in any contiguous subvector of the input. The author starts from a straightforward O(n3) algorithm and goes down to O(n2), O(n*logN) and finally O(n) algorithm.
  • You have a sorted array of 1000 integers. How can you tune a binary search algorithm for this case?
  • Given a very long sequence of bytes, how would you efficiently count a number of one (set) bits?
  • How to compress 75,000 English words into 52K RAM in order to use them in a spellchecker program?
  • How to generate the random text which looks like it was hand-written if you have several input texts to teach your program?

Book review: MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems

A good MapReduce algorithms reference book!

I’d like to review “MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems” book I have recently read.

This book describes a list of MapReduce algorithms (authors call them “patterns”) as well as a few ways to create the more complex algorithms out of these building blocks. Algorithms were divided into the following groups:

  • Summarization – Numerical summarization (sum, min, max, count, avg, median, std deviation); Inverted index summarization; Counting with counters (limited number of counters)
  • Filtering – Simple filter; Bloom filter; Top N filter; Distinct filter
  • Data organization – Structured to hierarchical; Partitioning; Binning; Total order sorting; Shuffling
  • Join patterns – Reduce side join; Replicated join; Composite join; Cartesian product

This book describes how to implement these algorithms on pure Hadoop using pure Java. Authors of this book suggest that such skill still should be required for 10% most complicated tasks not covered by frameworks like Pig.

A lot of reviewers have already pointed out that examples in this book were not tested. There is a number of mistakes (but not serious ones) in Java listings.

I want to suggest not paying to much attention to these mistakes – do not treat this book as your first Hadoop textbook. Use this book as a reference of MapReduce algorithms instead – read text part of a book instead of source codes! It describes all these algorithms in sufficient depth (Bloom filter description may be an exception) in order to understand how they operate even without prior Hadoop/MapReduce knowledge.

This book may also be useful for interview preparation to the companies using BigData solutions (like Google) – it describes the algorithms, and the algorithms knowledge is usually required on such interviews.