A client we work with recently had an incident, where one of their systems slowed to a crawl, and their CPU usage shot up to 100%.
"But how can this be?" asked the client. "We quadrupled the number of CPUs in that box last time this happened!".
How indeed? What was it doing with all that CPU?
As it happened, the system in question was Java-based, so we could dig deeper. On most modern versions of Java, it's possible - even easy - to attach a profiler to a running Java application. Oracle Java and OpenJDK both come with VisualVM, which lets you see what your app is doing, using nothing more that point-and-click. There are similar tools for IBM's J9, as well as paid tools like AppDynamics or dynaTrace.
The profiling data was enlightening. The application was spending most of it's time in a "critical section" - that's a section of code that can only handle one request at once. Because of this, only one CPU on the machine was doing useful work at any one time, so all those extra CPUs were no use to anyone.
Moreover, when we looked at the "critical" code, it turned out that it wasn't even necessary. It was re-reading and re-processing a particular file (a WSDL file, for the techies out there), over and over again. WSDL files never change whilst an application is running, so the fix was obvious: Process the file once, and forget about it.
This seemingly minor change produced an 835% improvement in throughput. That's not a typo.
And best of all, the application was scalable again. If they get close to their CPU capacity again, there's a pretty good chance that they can fix it with more CPUs.
So, the moral of the story: If your system is close to its capacity, don't just throw resources at it, find out why.