A while ago we conducted a study to find the fastest IO subsystem that money can buy these days. The only restriction being that it should be build out of (semi) commodity hardware. That meant extremely expensive solutions where a sales representative needs to visit your office in order to get a price indication where off limits. We reported about our findings here: One DVD per second
When we examined the graphs resulting from our previous study, it was pretty obvious that we were hitting a bottleneck somewhere. No matter the amount of hardware that we threw at the system, all graphs quickly saturated at around a 65K IOPS limit. This is most apparent in the following graphs, which we repeat here from the previous study:
fig 1. random IOps v.s. block-size and threads
RAID0 8 SSDs
fig 2. random IOps v.s. block-size and threads
RAID0 12 SSDs
fig 3. random IOps v.s. block-size and threads
2x8xRAID0 (16 SSDs total)
In these examples, maxing out the number of SSDs per controller (figure 2) or doubling the entire storage system and then striping these together (figure 3), does not yield any performance benefits with respect to the maximum random number of IOPS. We do see that the setup depicted in figure 3 gives us some performance benefits, but basically it only pushes two more data points (64k and 128k block sizes) towards the imaginary 60~65K barrier.
For SSD performance, correct alignment is a matter of the uttermost importance. A misaligned SSD will seriously under perform. In our previous study we did our best to find a correct alignment. Since we were using LVM to do the striping, we set the metadata size to 250. This is actually a little trick we found, and apparently others found out about too:
LVM likes to allocate 192k for its header information, and 192k is not a multiple of 128k. So if you are creating file systems as logical volumes, and you want those volume to be properly aligned you have to tell LVM that it should reserve slightly more space for its meta-data, so that the physical extents that it allocates for its logical volumes are properly aligned. Unfortunately, the way this is done is slightly baroque:
# pvcreate â€“metadatasize 250k /dev/sdb2
Physical volume â€œ/dev/sdb2â€³ successfully created
Why 250k and not 256k? I canâ€™t tell you â€” sometimes the LVM tools arenâ€™t terribly intuitive. However, you can test to make sure that physical extents start at the proper offset by using:
When digging some more into this, we learned about the concept “data stripe width”, which is defined as:
data width = array stripe size * number of datadisks in your array
E.g. for raid 6 it’s array stripe size * (number of disks in array -2) and for raid 5 it’s array stripe size * (number of disks in array -1). This is an important number when you are striping multiple smaller arrays (sub-arrays, or what we called ‘legs’) into a larger array. It gives you the amount of data that together occupies exactly 1 array stripe of each device that makes up the sub-array. It’s this number that you should work with when trying to align LVM.
Stevecs explains this rather well:
pvcreate –metadatasize X /dev/sda /dev/sdb
# X 511 or whatever it takes to start at 512KiB with padding)
vgcreate ssd -s Y /dev/sda /dev/sdb
# Y here is your physical extent size which you probably want to be a multiple of your data stripe width
#default is 4MiB and that’s fine as we our data stripe widths here are evenly divisable to that for 8K (32KiB) or 128K (512KiB)
lvcreate -i2 -IZ -L447G -n ssd-striped ssd
# Z here should be a multiple of your data stripe width (32KiB or 512KiB)
Applying this theory did help in some cases, but the 65k barrier hadn’t moved an inch.
Something we overlooked in the previous study is the fact that Linux has a configurable IO scheduler. Currently available are the following four:
- Complete Fair Queueing Scheduler (CFQ)
- Anticipatory IO Scheduler (AS)
- Deadline Scheduler
- No-op Scheduler
For an in-depth discussion of each scheduler, see: 3. Schedulers / Elevators
Especially the CFQ and AS schedulers are designed to optimize IO in such a way that seek latencies are prevented. The thing is of course, SSDs don’t have any notable seek delay, especially not relative to other blocks. It thus doesn’t matter at all whether you e.g. fetch blocks 1, 10 and 20 sequentially or totally at random. Setting the scheduler to noop, which doesn’t try to be smart and basically does nothing at all, improved overall performance again. However, there was STILL a clear saturation point in our performance graphs. The barrier was moved up a little, but it was still very much there.
At this point we were frantically wading through the C source code of the Areca driver, monitoring the interrupts and basically grasping at every straw within our reach. It was then that my co-worker Dennis noticed the following when examining the jfs module:
author: Steve Best/Dave Kleikamp/Barry Arndt, IBM
description: The Journaled Filesystem (JFS)
vermagic: 2.6.26-1-amd64 SMP mod_unload modversions
parm: nTxBlock:Number of transaction blocks (max:65536) (int)
parm: nTxLock:Number of transaction locks (max:65536) (int)
parm: commit_threads:Number of commit threads (int)
After we also examined the source code of the jfs driver (hooray for open source, see jfs_txnmgr.c), it became clear that we had found our bottle neck. We thus tried with another high performance file-system that we had wanted to test before, but never found the time for. This appeared to be the break through that we were looking for. Basically by changing a J into an X in our setup, we more than doubled the amount of IOPS and got far more than the 100K IOPS we were targeting.
Figure 4 shows the difference between using JFS and the CFQ scheduler vs using XFS and the NOOP scheduler, both using the 2x6xRAID6 configuration (12 disks total). Results as before are obtained from the easy-co benchmark tool. Unfortunately, we changed our RAID controllers from a dual 1680 to a dual 1231 in the mean time and we didn’t redo the easy-co benchmark. It nevertheless clearly shows how the barrier has been broken:
fig 4. random IOps v.s. block-size
2x6xRAID0 (12 SSDs total)
There are still some questions left unanswered. With this setup adding more disks to the array (from 2×6 to 2×8) did not improve the IOPS performance at all. Neither did changing the RAID level from 6 to 10, which theoretically should have given us another performance boost.
In the previous study we only used easy-co’s bm-flash as our benchmark tool. For this study we verified our findings using another tool; xdd. Where bm-flash tests for a fixed amount of time (10 seconds per item), xdd tests for a given amount of data. This makes it easier to rule out caching effects by specifying an amount of data that surely doesn’t fit in the cache, thereby forcing the controller to actually go to disk. Since our setup uses 8GB cache, we used 16GB of data for each test run that was randomly read from a 64GB file use direct IO. Each test was repeated multiple times and the final result taken as an average over these runs (there was no notable difference between each run though).
Figure 5 shows the results for a 2x6xRAID6 setup, using 128 threads, LVM, array stripe size set to 128KB and the software stripe size set to 512KB. The following exceptions hold:
- ssz = array stripe size 8KB, software stripe size 32KB
- mdamd = mdadm used instead of lvm for software striping
- raid10 = RAID10 used instead of RAID6
fig 5. random IOps v.s. block-size.
2x6xRAID6 (12 SSDs total)
Note that this graph only shows part of the data that was shown in fig 4. Since XDD tests take a great deal longer than bm-flash tests and the interesting data is within the 1KB to 32KB range anyway, we decided to limit the XDD tests to that range.
When looking at figure 5 it becomes clear we’re seeing the exact same barrier as when using bm-flash. In this figure too, the graph shows that XFS breaks through this barrier. However, the actual number of reported IOPS are a good deal lower as that what bm-flash reported. Here too, going from 2x6xRAID6 to 2x8xRAID10 did not improve performance, although there is a slight performance difference for the larger block sizes 16KB and 32KB.
Performance tuning remains a difficult thing. Having powerful hardware is one thing, but extracting the raw power from it can be challenging to say the least. In this study we have looked at alignment, IO schedulers and file systems. Of course it doesn’t stop there. There are many more file systems and most file systems have their own additional tuning parameters. Especially XFS is rich in that area, but we have yet to look into that.
Benchmarking correctly is a whole other story. Its important to change only one single parameter between test runs meant for comparison, and to document all involved settings for each and every test run. Next to that its important to realize what it is that you’re testing. Since bm-flash only runs each sub-tests for 10 seconds, it’s likely that the cache is tested in addition to the actual disks. Also, in this study we only looked at random read performance, but write performance is important too. Taking write performance into account (pure writes, or doing a 90% read/10% write test) can paint a rather different picture of the performance characteristics.
Many thanks have to go to stevecs, who kindly provided lots of insight that helped us to tune our system. See this topic for additional details on that.
This benchmark was brought to you by Dennis Brouwer and Arjan Tijms of JDevelopment, an M4N team.