Some thoughts from the Storage Advisor
I get a lot of calls from people who are interested in maxCache … how does it work, what does it do, and most importantly … will it work for me? So I thought I’d put some ramblings down on what has worked for customers and where I think maxCache could/should be used.
Firstly just a quick summary of maxCache functionality in plain English. You need an Adaptec card with “Q” on the end of it for this to work, and no, you can’t upgrade a card without “Q” to a “Q” card – but you can swap out the drives from an existing controller to a “Q” controller, then plug in SSDs and enable maxCache (bet you didn’t know that one). maxCache is the process of taking SSDs and treating them as read and write cache for a RAID array – that’s a basic statement but it’s pretty close to what happens – add a very large amount of cache to a controller.
So let’s take an existing system that’s running 8 x enterprise SATA in a RAID 5 – pretty common configuration. That might be connected to a 6805 controller in an 8-bay server. You want to make this thing faster for the data that has ended up on this server without reconfiguring the server or rebuilding the software installation. This server started life as just a plain file server, but now has small database, accounting software, and is now running terminal server … a far cry from what this thing started life as. You want to increase the performance of the random data. maxCache does not impact or affect the performance of streaming data – it only works on small, random, frequent blocks of data.
Upgrade the drivers in your OS (always a good starting point) and make sure the new drivers support the 81605ZQ. In most OS this is standard – we have for example one windows driver that supports all our cards. Then disconnect the 6 series from the drives, plug in and wire up the 81605ZQ and reboot. All should be well. You will see some performance difference as the 8 series is dramatically quicker than the 6 series controller, but the spinning drives will be the limiting factor in this equation.
Once you’ve seen that all is working well, and you’ve updated maxView management software to the latest version etc, then shut the system down, grab a couple of SSDs (lets for argument sake say 2 x 480GB Sandisk Extreme Pro) and fit them in the server somewhere. Even if there are no hot swap bays available there is always somewhere to stick an SSD (figuratively speaking) – they don’t vibrate and don’t get hot so they can be fitted just about anywhere.
Create a RAID 1 out of the 2 x SSDs. Then add that RAID 1 to the maxCache pool (none of which takes very long). When finished enable maxCache read and write cache on your RAID 5 array. Sit back and watch. Don’t get too excited as nothing much seems to happen immediately. In fact maxCache takes a while to get going (how long is a while? … how long is a piece of string?). The way it works is that once enabled, it will watch the blocks of data that are transferring back and forth from the storage to the users and vice versa.
So just like a tennis umpire getting a sore neck in the middle of a court, the controller watches everything that goes past. It then learns as it goes as to what is small, random and frequent in nature, keeping track of how often blocks of data are read from the array etc. As it sees suitable candidates of data blocks, it puts them in a list. Once the frequency of the blocks hits a threshhold, the blocks are copied in the background from the HDD array to the SSDs. This is important – note that it is a “copy” process – not a moving process.
Once that has happened, a copy of the data block lives on the SSDs as well as on the HDD array. Adaptec controllers use a process of “shortest path to data”. When a request comes for a block of data from the user/OS, we look first in the cache on the controller. If it’s there then great, it’s fed straight from the DDR on the controller (quickest possible method). If it’s not there then we look up a table running in the controller memory to see if the data block is living on the SSDs. If so, then we get it from there. Finally, if it can’t be found anywhere we’ll get it from the HDD array, and will take note of the fact that we did (so adding this data block to the learning process going on all the time).
Why does this help? Pretty obviously the read speed of the SSD is dramatically faster than the spinning drives in the HDD array, especially when it comes to a small block of data. Now as life goes on and users read and write to the server we are learning all the time, and constantly adding new blocks to the SSD pool. Therefore performance increases over a period of time rather than being a monumental jump immediately.
The SSD write cache side of things comes into play when blocks that live in the SSD pool (remembering these are copies of data from the HDD) are updated. If the block is already in the SSD pool then it’s updated there, and copied across to the HDD as a background process a little later (when the HDD are not so busy).
End result … your server read and write performance increases over a period of time.
Pitfalls and problems for young players …
All this sounds very easy, and in fact it is, but there are some issues to take note of that require customer education as much as technical ability.
Speed before and after
If you have no way of measuring how fast your data is travelling prior to putting maxCache in the system, then you won’t have any way of measuring later, so you can only go by “feel” … what the users experience when accessing data. While this is a good measure, it’s pretty hard to quantify.
Let me share some experiences I had from the early days of showing this to customers. I added maxCache to an existing system for a customer on a trial basis (changing controller to Q, adding SSD etc). Left the customer running for a week feeling quite confident that it would be a good result when I went back. Upon return, the customer indicated that he didn’t think it was much of a difference and wasn’t worth the effort or cost. So I put the system back the way it was before I started (original controller and no SSD) and rebooted. The customer started yelling at me very loudly that I’d stuffed his system … “it was never this slow before!” Truth of the matter was that it was exactly the same as before, so the speed was what he had been living with. Lesson: customers are far less likely to say anything about a computer getting faster, but they yell like stuck pigs as soon as something appears to be “slower”
Second example was in a terminal server environment. This time we could measure the performance of the server by measuring the logon time of the terminal server screen etc. It was pretty bad (about 1 minute). So we went through the process again and added maxCache. The boss of the organization (who happen to be a good reseller of mine) immediately logged on to TS – and grandly indicated that there was no difference and I didn’t know what I was doing. So we went to the pub. Spent a good night out on the town and went back to the customer in the morning (a little the worse for wear). The boss got to work around 10.00am (as bosses do) and was pretty much the last person to log on to TS that morning. Wow, 6 seconds to log on. We then had the Spanish Inquisition (no-one expects the Spanish Inquisition – https://www.youtube.com/watch?v=7WJXHY2OXGE) as to what we had done that night. The boss was thinking we’d spent all night working on the server instead of working on the right elbow.
In reality, the server had learnt the data blocks involved in the TS logon (which are pretty much the same for all users), so by the time he logged in it was mostly reading from the SSDs, hence a great performance improvement. Lesson: educate the customer as to how it works and what to expect before embarking on the grand installation.
The third and last experience was with performance testing. I’ve already blogged about this, but it bears mentioning here. Customer running openE set up his machine and did a lot of testing (unfortunately in a city far away from me so I could not do hands on demo etc). Lots of testing with iometer did not prove a great deal of performance improvement, but when finally biting the bullet and putting the server into action, the customers were ecstatic. A great performance improvement on Virtual Desktop software. Lesson: spend a lot more time talking to the customer about how the product works so they understand its random data that’s at play here, and that performance testing streaming data won’t show any performance improvement whatsoever.
There are a lot of servers out there that would benefit from maxCache to speed up the random data that has found its way onto the server whether intentionally or not. It needs to be kept in mind that servers don’t need rebuilding to add maxCache, and it can be added (and removed) without any great intrusion into a client’s business.
The trick is to talk to the customer, talk to the users and find out what the problems in life are before just jumping in and telling them that this will fix their problems. Then again, you should probably do that anyway before touching anything on a server … but that’s one of life’s lessons that people have to work out for themselves