Book review: Systems Performance: Enterprise and the Cloud

All you need to know about Linux performance monitoring and tuning.

If you have visited this blog, you are likely to be interested in Java performance tuning. You want to know why ArrayList is better than LinkedList, why HashMap will usually outperform TreeMap (but we should still use the latter if we want a sorted map) or why date parsing could be so painfully slow.

Nevertheless, at some point of your career you will reach the situation when you will have to consider your application environment – server hardware, other applications running on your server and other servers running in your network (as well as many other things).

You may for example want to know why disk operations were so quick on your development box, but became a major issue on the production box. There could be various reasons:

  • Trying to acquire a file lock on NFS too often
  • Other process is using the same disk – legally or due to a misconfiguration
  • Operating system is using the same disk for paging
  • Your development box has an SSD installed, but a production box relies on the “ancient” 🙂 HDD technology
  • Or lots of other reasons

Or you may be on the other side of the spectrum and trying to squeeze the last cycles out of a critical code path. In this situation you may want to know which levels of memory hierarchy your code is accessing (L1-L3 CPU caches, RAM, disks). Java does not provide you such information, so you have to use OS monitoring tools to obtain it. This will allow you to modify your algorithm, tune your dataset size so it will fit into the appropriate level of memory hierarchy.

Or you are probably on the edge of the progress and want to deploy your brand new application on the cloud. The biggest issue with clouds is that you have to pay for everything – excessive CPU usage (as well as for non-excessive 🙂 ), suboptimal IO as well as high memory consumption (usually via a requirement to pay for the larger and more expensive instances). Besides that your application might be affected by the other tenants of the same physical box – for example HDD is a non-interruptible device – one of tenants can make a temporary denial of service “attack” on the other tenants while it is still in his quota. What tools and strategies would you use for performance monitoring/tuning in the cloud?

“Systems Performance: Enterprise and the Cloud” by Brendan Gregg is the best reference book I have seen on Linux and Solaris performance monitoring. It is written for system administrators, so it is not bound to any programming languages. The book starts with a description of methodologies which could be used for performance issue troubleshooting. Introduction chapters are followed by the chapters related to the following hardware components:

  • CPU
  • Memory
  • File systems
  • Disks
  • Network
  • Cloud computing

Each of these chapters starts with an overview of a given hardware component followed by possible performance tuning methodologies description.

The last chapter of this book describes a real world performance investigation (in my opinion, you should start reading this book from this chapter 🙂 ).

I would recommend to order a paper version of this book, because it should serve as a handy reference book for the complex performance issue investigations.