There is a downside to great storage performance!
One might think that great storage performance is great performance! Given the infamous storage performance gap, what could possibly be wrong with great performing storage hardware like SSD?
Let me recount an actual case of a storage performance issue, one that did not involve SSDs, but which causes me to anticipate a spate of issues in the near future.
Let’s call this the case of Customer C. Over a weekend, critical production jobs began to run extremely slowly. Jobs which normally finished in a couple of hours were running many, many hours and not meeting the required deadlines! Issues seemed to be isolated to a couple of database servers, and those server stats showed pretty high read and write response times.
Obviously a STORAGE PERFORMANCE PROBLEM! What’s gone wrong with that storage?
Fortunately, there was a considerable amount of storage performance information available for the storage subsystem. The configuration consisted of about 80 RAID 5 arrays. One array was running consistently at 99+ % utilization, while all the others were running at modest utilization (see graph). Read and Write response times were pretty high too! Customer C quickly confirmed that the critical databases were indeed on that particular array. What more evidence of a storage problem could be required? Customer C demanded an “action planâ€, storage remediation – namely, install SSDs! They are so fast that these response times would never be a problem.
Not withstanding the speed of SSDs, it is prudent to ask where all that activity on that busy array was coming from. Were the production databases suddenly saturating the array? No, in fact they had rather modest IO rates with admittedly high response times.
But there were several other extremely busy volumes allocated to that array. They had names like rootvgxxx. Hmm…. The xxx part of the volume names pointed to a couple of unimportant servers. Still a storage problem as far as Customer C was concerned. Nevertheless, someone took a look at those other servers and discovered they were in a tight loop of core dumps, writing constantly to dump logs on the suspect array, hour after hour, for several days. A reset on those (unimportant) systems solved the problem immediately.
You might ask how constant core dumps from multiple systems could go unnoticed. Well, there were over 300 systems attached to that storage subsystem. Some were considered important and others not. The production systems were closely monitored, but those “unimportant†systems did not involve close monitoring. Obviously, a rigorous process of performance monitoring should have caught this before it became a “storage problemâ€. I am not so interested in criticizing the monitoring processes as in speculating what the impact of better storage allocation might have been, and how SSDs might have behaved in this set of circumstances.
Actually, it has not been best practice to allocate volumes to one particular RAID array for quite some time. Individual servers are powerful enough to overload a single array of 7 or 8 or 10 disks. It has (for some time) been better practice to use various techniques to spread activity across multiple RAID arrays. In this case, let us imagine how this situation might have played. The rogue systems doing all that writing would spread their activity across many more arrays and many more individual disks. The write response times for this activity might still be elevated, but not absurd. And the production systems would still have been impacted, but not quite so severely, not quite so suddenly and dramatically. It would look much more like a gradual degradation in storage performance.
And what if SSDs had been in the mix? SSDs are quick! Even writes, which are the weaker part of SSD performance, are quick. This problem might have gone undetected for days and days. So in the end I guess I do have to fault the monitoring process – or the lack thereof.
What about “automatic tiering†and dynamic storage optimization algorithms? They could make it even more difficult to see through this particular set of circumstances. But this was a real set of circumstances, not hypothetical at all. Fancy algorithms and super-speed hardware are no doubt great for optimizing performance in normal circumstances. But they might just make it harder to detect abnormal circumstance.
Storage performance monitoring is a crucial business process, even with SSDs and clever storage management algorithms. The downside of great performance is that it can mask the weakness in your performance monitoring and performance management processes.