Archive for the ‘SSD’ Category

100K+ IOPS on semi-commodity hardware

1 June 2009

Abstract

A while ago we conducted a study to find the fastest IO subsystem that money can buy these days. The only restriction being that it should be build out of (semi) commodity hardware. That meant extremely expensive solutions where a sales representative needs to visit your office in order to get a price indication where off limits. We reported about our findings here: One DVD per second

The 65K IOPS barrier

When we examined the graphs resulting from our previous study, it was pretty obvious that we were hitting a bottleneck somewhere. No matter the amount of hardware that we threw at the system, all graphs quickly saturated at around a 65K IOPS limit. This is most apparent in the following graphs, which we repeat here from the previous study:

fig 1. random IOps v.s. block-size and threads
RAID0 8 SSDs

iops-random-raid0-8.png

fig 2. random IOps v.s. block-size and threads
RAID0 12 SSDs

iops-random-raid0-12.png

fig 3. random IOps v.s. block-size and threads
2x8xRAID0 (16 SSDs total)

iops-random-raid00-16.png

In these examples, maxing out the number of SSDs per controller (figure 2) or doubling the entire storage system and then striping these together (figure 3), does not yield any performance benefits with respect to the maximum random number of IOPS. We do see that the setup depicted in figure 3 gives us some performance benefits, but basically it only pushes two more data points (64k and 128k block sizes) towards the imaginary 60~65K barrier.

Alignment, stripe size and data stripe width

For SSD performance, correct alignment is a matter of the uttermost importance. A misaligned SSD will seriously under perform. In our previous study we did our best to find a correct alignment. Since we were using LVM to do the striping, we set the metadata size to 250. This is actually a little trick we found, and apparently others found out about too:

LVM likes to allocate 192k for its header information, and 192k is not a multiple of 128k. So if you are creating file systems as logical volumes, and you want those volume to be properly aligned you have to tell LVM that it should reserve slightly more space for its meta-data, so that the physical extents that it allocates for its logical volumes are properly aligned. Unfortunately, the way this is done is slightly baroque:

# pvcreate –metadatasize 250k /dev/sdb2
Physical volume “/dev/sdb2″ successfully created

Why 250k and not 256k? I can’t tell you — sometimes the LVM tools aren’t terribly intuitive. However, you can test to make sure that physical extents start at the proper offset by using:

See: Aligning filesystems to an ssds erase block size/

When digging some more into this, we learned about the concept “data stripe width”, which is defined as:

data width = array stripe size * number of datadisks in your array

E.g. for raid 6 it’s array stripe size * (number of disks in array -2) and for raid 5 it’s array stripe size * (number of disks in array -1). This is an important number when you are striping multiple smaller arrays (sub-arrays, or what we called ‘legs’) into a larger array. It gives you the amount of data that together occupies exactly 1 array stripe of each device that makes up the sub-array. It’s this number that you should work with when trying to align LVM.

Stevecs explains this rather well:

pvcreate –metadatasize X /dev/sda /dev/sdb
# X 511 or whatever it takes to start at 512KiB with padding)

vgcreate ssd -s Y /dev/sda /dev/sdb
# Y here is your physical extent size which you probably want to be a multiple of your data stripe width
#default is 4MiB and that’s fine as we our data stripe widths here are evenly divisable to that for 8K (32KiB) or 128K (512KiB)

lvcreate -i2 -IZ -L447G -n ssd-striped ssd
# Z here should be a multiple of your data stripe width (32KiB or 512KiB)

See: xtremesystems

Applying this theory did help in some cases, but the 65k barrier hadn’t moved an inch.

The Linux IO scheduler

Something we overlooked in the previous study is the fact that Linux has a configurable IO scheduler. Currently available are the following four:

  1. Complete Fair Queueing Scheduler (CFQ)
  2. Anticipatory IO Scheduler (AS)
  3. Deadline Scheduler
  4. No-op Scheduler

For an in-depth discussion of each scheduler, see: 3. Schedulers / Elevators

Especially the CFQ and AS schedulers are designed to optimize IO in such a way that seek latencies are prevented. The thing is of course, SSDs don’t have any notable seek delay, especially not relative to other blocks. It thus doesn’t matter at all whether you e.g. fetch blocks 1, 10 and 20 sequentially or totally at random. Setting the scheduler to noop, which doesn’t try to be smart and basically does nothing at all, improved overall performance again. However, there was STILL a clear saturation point in our performance graphs. The barrier was moved up a little, but it was still very much there.

The file-system

At this point we were frantically wading through the C source code of the Areca driver, monitoring the interrupts and basically grasping at every straw within our reach. It was then that my co-worker Dennis noticed the following when examining the jfs module:

modinfo jfs
filename: /lib/modules/2.6.26-1-amd64/kernel/fs/jfs/jfs.ko
license: GPL
author: Steve Best/Dave Kleikamp/Barry Arndt, IBM
description: The Journaled Filesystem (JFS)
depends: nls_base
vermagic: 2.6.26-1-amd64 SMP mod_unload modversions
parm: nTxBlock:Number of transaction blocks (max:65536) (int)
parm: nTxLock:Number of transaction locks (max:65536) (int)
parm: commit_threads:Number of commit threads (int)

After we also examined the source code of the jfs driver (hooray for open source, see jfs_txnmgr.c), it became clear that we had found our bottle neck. We thus tried with another high performance file-system that we had wanted to test before, but never found the time for. This appeared to be the break through that we were looking for. Basically by changing a J into an X in our setup, we more than doubled the amount of IOPS and got far more than the 100K IOPS we were targeting.

Figure 4 shows the difference between using JFS and the CFQ scheduler vs using XFS and the NOOP scheduler, both using the 2x6xRAID6 configuration (12 disks total). Results as before are obtained from the easy-co benchmark tool. Unfortunately, we changed our RAID controllers from a dual 1680 to a dual 1231 in the mean time and we didn’t redo the easy-co benchmark. It nevertheless clearly shows how the barrier has been broken:

fig 4. random IOps v.s. block-size
2x6xRAID0 (12 SSDs total)

iops-random-raid0-8.png

There are still some questions left unanswered. With this setup adding more disks to the array (from 2×6 to 2×8) did not improve the IOPS performance at all. Neither did changing the RAID level from 6 to 10, which theoretically should have given us another performance boost.

Alternative benchmark tools

In the previous study we only used easy-co’s bm-flash as our benchmark tool. For this study we verified our findings using another tool; xdd. Where bm-flash tests for a fixed amount of time (10 seconds per item), xdd tests for a given amount of data. This makes it easier to rule out caching effects by specifying an amount of data that surely doesn’t fit in the cache, thereby forcing the controller to actually go to disk. Since our setup uses 8GB cache, we used 16GB of data for each test run that was randomly read from a 64GB file use direct IO. Each test was repeated multiple times and the final result taken as an average over these runs (there was no notable difference between each run though).

Figure 5 shows the results for a 2x6xRAID6 setup, using 128 threads, LVM, array stripe size set to 128KB and the software stripe size set to 512KB. The following exceptions hold:

  • ssz = array stripe size 8KB, software stripe size 32KB
  • mdamd = mdadm used instead of lvm for software striping
  • raid10 = RAID10 used instead of RAID6

fig 5. random IOps v.s. block-size.
2x6xRAID6 (12 SSDs total)

xdd_2x6xraid6_jfs_vs_xfs.png

Note that this graph only shows part of the data that was shown in fig 4. Since XDD tests take a great deal longer than bm-flash tests and the interesting data is within the 1KB to 32KB range anyway, we decided to limit the XDD tests to that range.

When looking at figure 5 it becomes clear we’re seeing the exact same barrier as when using bm-flash. In this figure too, the graph shows that XFS breaks through this barrier. However, the actual number of reported IOPS are a good deal lower as that what bm-flash reported. Here too, going from 2x6xRAID6 to 2x8xRAID10 did not improve performance, although there is a slight performance difference for the larger block sizes 16KB and 32KB.

Conclusion

Performance tuning remains a difficult thing. Having powerful hardware is one thing, but extracting the raw power from it can be challenging to say the least. In this study we have looked at alignment, IO schedulers and file systems. Of course it doesn’t stop there. There are many more file systems and most file systems have their own additional tuning parameters. Especially XFS is rich in that area, but we have yet to look into that.

Benchmarking correctly is a whole other story. Its important to change only one single parameter between test runs meant for comparison, and to document all involved settings for each and every test run. Next to that its important to realize what it is that you’re testing. Since bm-flash only runs each sub-tests for 10 seconds, it’s likely that the cache is tested in addition to the actual disks. Also, in this study we only looked at random read performance, but write performance is important too. Taking write performance into account (pure writes, or doing a 90% read/10% write test) can paint a rather different picture of the performance characteristics.

Acknowledgement

Many thanks have to go to stevecs, who kindly provided lots of insight that helped us to tune our system. See this topic for additional details on that.


This benchmark was brought to you by Dennis Brouwer and Arjan Tijms of JDevelopment, an M4N team.

SSD performance improvements set LIVE

22 May 2009

In a previous blog entry (One DVD per second) we described how we build a new fast SSD based Database server and how we benchmarked it.

This week we put the new SSD DB server to the ultimate test, namely “Going Live”.

It turned out that our investments definitely payed off. Our main online application M4N has seen a massive speed improvement. We found an average performance increase of ~6 times for all queries. As a result, the initial average performance increase for the whole application is approximately a factor 2. Keep in mind that the whole application depends on more than just the main DB. Of course, the performance of the Java Application Server and the network bandwidth play important roles as well.

The performance increase is most apparent when simply browsing through the different pages of the application. The application feels very snappy and even the more complex pages load nearly instantly. We have statistics on our average page speeds so I included this in the pictures.

Picture one the improvements with testing:

Speed improvements new database server SSD Postgresql

Speed improvements new database server SSD Postgresql

Picture two the improvements with speed on execution of queries:

Statistics speed improvement Postgresql database

Statistics speed improvement Postgresql database

Finally, during the night we execute a slew of maintenance queries. Normally, we see a number like this after the script that triggers these queries has finished executing:


real 348m6.451s
user 0m4.912s
sys 0m0.808s

The other morning however we saw this:


real 30m34.747s
user 0m2.752s
sys 0m1.008s

This amounts to a performance increase of over 11 times.

For now we thus carefully conclude that using SSDs indeed improves performance by a large margin.

One DVD per second

20 February 2009

Abstract

For modern database systems the main bottleneck is usually IO. In order to speed up our database we conducted a study to find the fastest IO system currently available. In this article we describe our approach and findings. The result: with our new database server filled with SSDs we can transfer the amount of data equivalent to nearly one DVD per second, 3.3GB/sec.

Introduction

Our current SAS hard-disk based database server (6 disks, RAID10) was having an access time bottle-neck, too much concurrent database access caused the server to slow down considerably. In a previous article building-the-new-battleship-mtron we promised test results using our new hardware.

We can now finally show you some of the obtained test results and give you a taste of what SSD based filing systems will bring. Of course, we run into several firmware incompatibility issues using this emerging hardware, found motherboard PCI-E bottle-necks, discovered Linux kernel performance differences, software specific benchmarking problems and strange file system hickups, but most of this has been circumvented to produce the figures listed in this article.

Benchmarking was not a goal by itself, given the amount of hardware (which reflects a certain amount of hard cash), what would be the best performing – cost effective and fail save setup using this hardware?

Well, we think we figured that out. Where to start… first let us explain why we choose this hardware setup at the first place. And for those who want to jump to conclusions, be our guest.

Hardware, the outfit

SSD: they come in several flavors where the best taste is the SSD which is build using SLC memory; they last longer and generally perform better on writing. The downside: they are much more expensive. We already had some hands on experience with MTRON SSDs in the past so we choose the newer, better and faster MTRON 7500 Pro series. Of course there are Intel, MemoRight, OCZ and Transcend too but at that time of writing MTRON was the best performer on paper. We ordered 12 MTRON MSP7535-032 (32GB) and 4 extra MTRON MSP7535-064 (64GB) SSDs.

Raid controller: that should be a PCI-E x8 board to handle all the IO. Unfortunately, at the time of writing no PCI-e x16 boards where available (and not so many, if any, server motherboards do support PCI-E x16 slots). Cache and even more cache should be available (only useful when Battery Back Up Unit applied). And, this may sound odd, but specifically for us, the controller also needed to support SAS disks. Namely we wanted to re-deploy some of our SAS hard-disks for good old time’s sake. A logical choice was to go for the ARC1680IX-12. Not only did we already gained some experience on a test server with it’s smaller brother, the ARC1680IX-8 but the recent firmware release 1.45 put it nearly on par with the native SATA controllers of the same brand. Problematic for us is the fact that all current high performance RAID controllers have been designed with traditional hard disks in mind. This means that the RAID controller often becomes the bottleneck. To overcome this bottleneck we used not one, but two ARC1680IX-12s in parallel. The setup we used is depicted in figure A.

fig A. RAID05 dual controller setup
dual_raid_setup.png

Server: the server motherboard selected is the Super Micro X7DWN+ that can hold up to 128GB of RAM and two Quad Core Xeon CPU’s. We equipped the board with two X5460 CPU’s (at that time the highest clocked Xeon, which comes in handy for CPU bound calculations) and a mere 48GB of RAM. The server will act as a database server, the more RAM the better so most of the data can then be hold in memory.

Casing: a couple of OS disks, legacy SAS hard-disks and 12 SSDs need a shelter, we have chosen the Super Micro SC846, a 4U server casing that nicely matches with our two 1680IX-12 controllers and comes with a redundant power supply, something which is truly enterprise worthy.

Firmware, the hurdles

First thing we did was to install the controller cards, wired the backplane, inserted an ordinary hard disk for the OS, installed Debian on it and inserted all SSDs in random order into the cabinet. Then the Reboot, and… show time!

No, not yet, the event log of the Areca controllers showed many Time-Out errors. Doing the regular stuff like building RAID arrays seemed impossible let alone to create a file system on the newly created block-devices.

Research revealed that the hardware came with different firmware as we used before. We found out that the 64GB SSDs where shipped with firmware 0.18R1H3, the same we used in our test server, while the 32GB SSDs came with 0.19R1, which was new to us. The drives having firmware 0.18R1H3 did work properly indeed. So we first asked our supplier to ship us the newer firmware 0.19R1H2 available at that time (sometimes newer is better). We flashed the disks to 0.19R1H2 but still no good; a lot of Time-Out errors appeared in the event log again. Then we figured out that sometimes older is better and asked MTRON if it was possible to downgrade the firmware and luckily it was. We where provided with the DOWN.EXE executable and the proper MTRON.MFI and we flashed the SSDs back to 0.18R1H3. Bingo… all seemed to work fine now. We recently tried to upgrade some SSDs to firmware 0.20R1 but that didn’t work either.

Not only MTRON releases firmware on a regular bases, but so does Areca. We therefore upgraded the ARC1680IX-12s to the 1.46 firmware released on 23-01-2009 and the whole ritual was performed again; flashing the SSDs to the latest version 0.20R1. At first glance no Time-Out errors where noticed after the flash, however building RAID arrays slowed down nearly ten times and a filing system created on the resulting block-device performed over 20 times worse as expected. We are back to 0.18R1H3 again. Currently Areca and Mtron are investigating in a joint effort what causes these problems. Some hurdles are taken but we are still running and didn’t finish yet, we will keep you posted. The latest news is that Areca was able to reproduce the problems, but located the problem to be within Intel’s IOP348 firmware. Of course Areca is unable to fix anything there, so it’s now Intel’s turn.

Motherboard, keeping the right PCI-E x8 lane

The motherboard comes with plenty of PCI-E x8 slots and we installed the RAID controllers in the slots that would give the best airflow in the casing. Since we have two controllers we created identical arrays on both controllers (lets call them “legs”). Tests showed that there was a read performance difference of some 25% between those legs. Swapping the controller cards, and even the SSDs showed that the cards where performing okay and the SSDs where not to blame either. When looking in the manual we discovered that the PCI-E x8 slots we tried where being driven by different chip-sets. Our board comes with an Intel 631xESB/632xESB IO controller chip (south-bridge) and the 5400MCH (north-bridge) where the 631xESB/632xESB IO controller chip performs less. If you are in the situation of putting any piece of hardware in your server be aware, or better test, which slot to take.

Put in figures: using the 5400MCH we where able to do a multi-threaded sequential and random READ with a maximum total bandwidth of 1750MB/s, while the 631xESB/632xESB IO controller stopped at 1400MB/s. WRITING was not affected by different slots, since this is bottle-necked by the disks them-selfs.

Benchmarking, the software as referee

This may be considered one of the most important aspects of deriving figures from a server; which benchmark software to use? Ideally that would be software for which a lot of results are already available for comparison. We evaluated some like: bonnie++, IOMeter, IOZone, bm-flash, Orion and some more, all with their own pros and cons.

We eventually settled down with bm-flash and IOMeter. bm-flash takes a fixed amount of time to finish and gives a quick impression on file system performance, while IOMeter with a default workload, is widely used so we have an abundance of figures for comparison. We dropped bonnie++ because under some conditions it will only show “++++” when no useful figures could be calculated. IOZone and Orion required a complete study to be interpreted.

In order to use IOMeter we had to drop Linux and installed Windows Server 2008. Dynamo (the test client) running on Linux and IOmeter GUI (manager) running on Windows are somehow incompatible and continuously caused crashes on our test server. For our escape to Windows Server 2008 we where stuck with NTFS.

For benchmarking Linux we chose for the default JFS file system because previous tests showed (in our case) that this used the least CPU overhead and performed better compared to ext3. One may go totally lost in file system tuning but we decided it was not doable to tune too many parameters at the time. Note that the mount options where set to have a minimum of writes to the disks:

rw,noatime,nodiratime

Linux software striping, the odd and even lanes

In order to increase the bandwidth we wanted to make use of software striping and as a side effect doubling the amount of RAM cache. We created two identical legs on the controllers, they where plugged in equal performing lanes and installed the mdadm and lvm2 packages for Debian (Debian Lenny – Linux 2 2.6.26-1-amd64). For both mdadm and lvm2 we configured striped block-devices. Let the games begin….

Not to bother you with too many figures right now we discovered an interesting finding: block-devices created by mdadm performed about 10% better compared to lvm2. The performance on write bandwidth is quite similar but the number of IOPS and read bandwidth using mdadm is higher. When doing a survey on internet we found that more people complained about this performance difference and that file system alignment and read ahead settings should be properly set. But even when all is properly configured the difference remains. E.g we can do max 65000 IOPS using mdadm and only 59000 IOPS with lvm2, we can read (sequential single threaded) max 1050MB/s using mdadm and only 980MB/s by lvm2. These measurements were obtained with reasonable fresh/new SSDs. Later on performance degraded and there was seemingly no way to get the original performance back. The mdadm/lvm2 performance difference can be attributed to the Linux kernel: mdadm uses the md kernel module while lvm2 uses the dm kernel module and these are beyond our control. Due to system administrator support we will not use mdadm but will go with lvm2.

Note: we recently retested on Debian Lenny, now stable. The max read bandwidth (sequential single threaded) for both mdadm and lvm2 stopped at 990MB/s. The difference in IOPS remained (65000 mdadm v.s. 60000 lvm2) but where only noticeable for small block-sizes (512B – 1kB) and therefor of minor importance to us.

Hickups, the file system’s full stop

During our test runs we sometimes noticed that the performance of the system was below expectations, a typical test run using bm-flash would look like [RAID0, 12 SSDs attached to one controller]:

test:/benchmark# ./bm-flash /ssd/test.txt 

Filling 4G before testing  ...   4096 MB done in 3 seconds (1365 MB/sec).

Read Tests:

Block |   1 thread    |  10 threads   |  40 threads   
 Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW   
      |               |               |               
 512B | 11307    5.5M | 63751   31.1M | 63312   30.9M 
   1K | 22720   22.1M | 63717   62.2M | 63086   61.6M 
   2K | 24788   48.4M | 63510  124.0M | 62734  122.5M 
   4K | 29158  113.8M | 63088  246.4M | 62361  243.5M 
   8K | 30170  235.7M | 61824  483.0M | 61089  477.2M 
  16K | 26110  407.9M | 61337  958.3M | 60991  952.9M 
  32K | 19936  623.0M | 51250 1601.5M | 50853 1589.1M 
  64K | 13589  849.3M | 27716 1732.2M | 27717 1732.3M 
 128K |  8273 1034.2M | 13877 1734.7M | 13881 1735.1M 
 256K |  4828 1207.2M |  6942 1735.5M |  6947 1736.8M 
 512K |  2563 1281.5M |  3472 1736.1M |  3476 1738.4M 
   1M |  1447 1447.0M |  1736 1736.6M |  1740 1740.2M 
   2M |   796 1592.3M |   866 1732.5M |   868 1737.1M 
   4M |   407 1630.0M |   430 1720.7M |   437 1748.3M 

Write Tests:

Block |   1 thread    |  10 threads   |  40 threads   
 Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW   
      |               |               |               
 512B | 28529   13.9M | 29048   14.1M | 28567   13.9M 
   1K | 27759   27.1M | 28213   27.5M | 28053   27.3M 
   2K | 27393   53.5M | 28036   54.7M | 27513   53.7M 
   4K | 26289  102.6M | 25278   98.7M | 26446  103.3M 
   8K | 23068  180.2M | 22138  172.9M | 21824  170.5M 
  16K | 17426  272.2M | 16820  262.8M | 17404  271.9M 
  32K | 10607  331.4M | 10988  343.3M | 11173  349.1M 
  64K |  6208  388.0M |  6894  430.9M |  6941  433.8M 
 128K |  3611  451.4M |  3895  486.9M |  4019  502.4M 
 256K |  1999  499.9M |  2116  529.0M |  2060  515.2M 
 512K |   834  417.4M |  1074  537.3M |  1024  512.2M 
   1M |   412  412.6M |   596  596.6M |   561  561.6M 
   2M |   280  561.3M |   240  480.0M |   207  414.7M 
   4M |   153  615.1M |   137  551.5M |   119  478.7M

However, if we configured different RAID configuration we noticed “gaps” in performance, output would typically look like [RAID5, 4 SSDs attached to one controller]:

test:/benchmark# ./bm-flash /ssd/test.txt 

Filling 4G before testing  ...   4096 MB done in 3 seconds (1365 MB/sec).

Read Tests:

Block |   1 thread    |  10 threads   |  40 threads
 Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW
      |               |               |
 512B |    24   12.2K | 28061   13.7M | 63593   31.0M
   1K | 22599   22.0M | 63882   62.3M | 63582   62.0M
   2K | 24644   48.1M | 63869  124.7M | 63275  123.5M
   4K | 29294  114.4M | 63519  248.1M | 62923  245.7M
   8K | 30188  235.8M | 62176  485.7M | 61474  480.2M
  16K | 26207  409.4M | 61542  961.5M | 61231  956.7M
  32K | 19944  623.2M | 51450 1607.8M | 50947 1592.1M
  64K | 13661  853.8M | 27717 1732.3M | 27718 1732.4M
 128K |  8356 1044.5M | 13878 1734.7M | 13881 1735.2M
 256K |  4756 1189.0M |  6942 1735.6M |  6947 1736.9M
 512K |  2563 1281.8M |  3473 1736.5M |  3476 1738.4M
   1M |  1383 1383.0M |  1737 1737.0M |  1739 1739.7M
   2M |   795 1590.1M |   852 1704.1M |   866 1733.0M
   4M |   408 1632.0M |   430 1720.3M |   436 1744.0M

Write Tests:

Block |   1 thread    |  10 threads   |  40 threads
 Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW
      |               |               |
 512B |  5642    2.7M |  7221    3.5M |  9223    4.5M
   1K |  7851    7.6M |  5218    5.0M |  5815    5.6M
   2K |  4142    8.0M |  4357    8.5M |  3976    7.7M
   4K |  3314   12.9M |  3379   13.2M |  2890   11.2M
   8K |  2251   17.5M |  2018   15.7M |  2500   19.5M
  16K |  1980   30.9M |  2659   41.5M |  3499   54.6M
  32K |  1303   40.7M |  1217   38.0M |  1342   41.9M
  64K |   240   15.0M |   144    9.0M |   126    7.9M
 128K |   499   62.3M |    96   12.0M |   110   13.8M
 256K |   309   77.2M |    13    3.2M
 512K |     6    3.4M
   1M |   154  154.0M |    10   10.5M
   2M |    80  160.3M |    99  198.7M |    99  199.1M
   4M |    42  170.7M |    54  218.7M |    57  230.0M

Notice the drop in performance that starts at writing the 64kB blocks.

Tests with IOMeter showed that under certain conditions the whole controller became unresponsive for several seconds, we even measured file system freezes in the order of 10 seconds (both reading and writing). We are still puzzling what causes these freezes but we assume that after heavy writing, and the controllers cache being fully filled up, the controller first writes some cached data to disk and only then starts working properly, we addressed Areca about this phenomenon what the intended behavior should be. We will keep you posted.

Figures, the benchmark results

Those interested in hard figures probably are looking at the graphs already. We deliberately will not show all the test results, we graphed them where possible to have an overview of SSD performance in different RAID configuarions.

In short; we have 16 SSDs and two controllers that support a maximum of 12 SSDs each. We tested RAID0 and RAID5 but skipped RAID6 and RAID10 arrays for write performance on these was rather poor. We repeatitively tested 1 to 12 SSDs with RAID0 and 3 to 12 SSDs with RAID5 configuration. For lvm2 striped block-devices (using the two controllers with software RAID0) we repeatitively tested 2×4 to 2×8 SSDs per RAID0 and RAID5 configuration.

Emphasized are 8kB block-size figures because our database implementation is compiled to work with 8kB block-size.

Sequential read, write and file copy versus number of SSDs and RAID configuration

dd if=/dev/zero of=/ssd/file.txt bs=8K count=5M [write 40GB file]
dd if=/ssd/file.txt of=/dev/zero [read 40GB file]
cp /ssd/file.txt /ssd/copy-of-file.txt [copy 40GB file]

fig 1. sequential read, write bandwidth [MB/s] v.s. SSDs and RAID configuration
8kB block-size

sequential-8k-1-thread.png

fig 2. 40GB file copy [seconds] v.s. SSDs and RAID configuration (lower is better)
40gb-file-copy.png
Conclusion: starting from 6 SSDs the controller’s read bandwidth saturates, while writing may benefit from adding additional SSDs. Usage of lvm2 software RAID0 and two controllers with an equal number of SSDs (8,10,12) always outperforms a single controller.

Random read, write bandwidth versus number of SSDs and RAID configuration

bm-flash /ssd/file.txt [random read, write and IOPS v.s block-size and threads]

fig 3. random read, write bandwidth in [MB/s] v.s. SSDs and RAID configuration
8kB block-size, 1 thread

8k-random-1-thread.png

fig 4. random read, write bandwidth in [MB/s] v.s. SSDs and RAID configuration
8kB block-size, 10 threads

8k-random-10-threads.png

fig 5. random read, write bandwidth in [MB/s] v.s. SSDs and RAID configuration
8kB block-size, 40 threads

8k-random-40-threads.png

Conclusion: bandwidth seems to be limited by the number of IO requests that can be spawned by a single thread (process). Scaling up the number of threads from 10 to 40 hardly influences the total bandwidth. Using lvm2 software RAID0 has a slight performance penalty on random read.

Random read, write bandwidth and IOPS versus block size and RAID configuration

bm-flash /ssd/file.txt [random read, write and IOPS v.s block-size and threads]

Note: since we concluded that 10 or 40 threads are not making much of a difference we only show the 1 and 10 thread random read, write bandwidth and IOPS graphs for readability.

RAID 0

fig 6. random read, write bandwidth in [MB/s] v.s. block-size and threads
RAID0 4 SSDs

bw-random-raid0-4.png

fig 7. random IOps v.s. block-size and threads
RAID0 4 SSDs

iops-random-raid0-4.png

fig 8. random read, write bandwidth in [MB/s] v.s. block-size and threads
RAID0 8 SSDs

bw-random-raid0-8.png

fig 9. random IOps v.s. block-size and threads
RAID0 8 SSDs

iops-random-raid0-8.png

fig 10. random read, write bandwidth in [MB/s] v.s. block-size and threads
RAID0 12 SSDs

bw-random-raid0-12.png

fig 11. random IOps v.s. block-size and threads
RAID0 12 SSDs

iops-random-raid0-12.png

RAID 5

fig 12. random read, write bandwidth in [MB/s] v.s. block-size and threads
RAID5 4 SSDs

bw-random-raid5-4.png

fig 13. random IOps v.s. block-size and threads
RAID5 4 SSDs

iops-random-raid5-4.png

fig 14. random read, write bandwidth in [MB/s] v.s. block-size and threads
RAID5 8 SSDs

bw-random-raid5-8.png

fig 15. random IOps v.s. block-size and threads
RAID5 8 SSDs

iops-random-raid5-8.png

fig 16. random read, write bandwidth in [MB/s] v.s. block-size and threads
RAID5 12 SSDs

bw-random-raid5-12.png

fig 17. random IOps v.s. block-size and threads
RAID5 12 SSDs

iops-random-raid5-12.png

RAID00 lvm2

fig 18. random read, write bandwidth in [MB/s] v.s. block-size and threads
2x4xRAID0 (8 SSDs total)

bw-random-raid00-8.png

fig 19. random IOps v.s. block-size and threads
2x4xRAID0 (8 SSDs total)

iops-random-raid00-8.png

fig 20. random read, write bandwidth in [MB/s] v.s. block-size and threads
2x6xRAID0 (12 SSDs total)

bw-random-raid00-12.png

fig 21. random IOps v.s. block-size and threads
2x6xRAID0 (12 SSDs total)

iops-random-raid00-12.png

fig 22. random read, write bandwidth in [MB/s] v.s. block-size and threads
2x8xRAID0 (16 SSDs total)

bw-random-raid00-16.png

fig 23. random IOps v.s. block-size and threads
2x8xRAID0 (16 SSDs total)

iops-random-raid00-16.png

As we can see in figures 18, 20 and 22, starting from a 256kB block-size, we finally reach our claimed 3.3 GB/sec. Although reading with 40 threads generally didn’t influence the results much, in this case however we already reached the mentioned 3.3 GB/sec consistently at a smaller block-size (128kB) using 40 threads. In figure 22 we show the results we obtained with 16 SSDs over 2 RAID controllers. However, the exact same read performance was also observed when using 8, 10, 12 and 14 SSDs over 2 RAID controllers, meaning read performance saturates quite rapidly and increasing the number of SSDs beyond a certain threshold doesn’t help to further improve read performance.

RAID05 lvm2

fig 24. random read, write bandwidth in [MB/s] v.s. block-size and threads
2x4xRAID5 (8 SSDs total)

bw-random-raid05-81.png

fig 25. random IOps v.s. block-size and threads
2x4xRAID5 (8 SSDs total)

iops-random-raid05-8.png

fig 26. random read, write bandwidth in [MB/s] v.s. block-size and threads
2x6xRAID5 (12 SSDs total)

bw-random-raid05-12.png

fig 27. random IOps v.s. block-size and threads
2x6xRAID5 (12 SSDs total)

iops-random-raid05-12.png

fig 28. random read, write bandwidth in [MB/s] v.s. block-size and threads
2x8xRAID5 (16 SSDs total)

bw-random-raid05-16.png

fig 29. random IOps v.s. block-size and threads
2x8xRAID5 (16 SSDs total)

iops-random-raid05-161.png

Figures 24 to 29 clearly show the awkward ‘gaps’ that we encountered multiple times during testing. The gaps shown here are of relatively minor severity (we saw ones that are much worse). As of now, we have no explanation for this and attribute it to compatibility problems between the Mtron SATA disks and the Areca SATA implementation (via the Intel IOP348 SATA stack).

RAID 5 – closer look at write performance

fig 30. random write only bandwidth in [MB/s] v.s. block-size and number of disks

bw-random-raid5-multiple.png

Figure 30 zooms in on the write performance we measured for several RAID 5 configurations, using 10 threads. The general trend is clearly that more disks improve write performance, a trend that was not so clearly visible for read performance. However, the graph shown in figure 30 is also a cause for great concern. There are a couple of awkward gaps to be seen and especially the 2×6 configuration is problematic. The first few and few last measuring points of the 2×6 configuration are exactly on par with the 2×8 one, but in the entire mid section all performance is gone.

Intended production setup: lvm2 – using 2 x RAID5 – 5 + 1 hot spare SSDs

In the above mentioned tests we found that software striping outperformed any configuration on a single RAID controller. Therefor the intention is to configure a software RAID0 setup using lvm2 for performance improvement. We have chosen a hardware RAID5 setup of 5 SSDs and 1 hot spare, this gives us much better performance over RAID6 and offers redundancy. In case one SSD fails the RAID will rebuild itself using the hot spare in about 15 minutes. During this time window we are vulnerable for data loss but that is a rather small and acceptable risk.

The performance measured for this setup sits somewhere between the performance shown in figures 24, 26 and 25, 27. We will spare you another graph. What about the sequential read, write performance? We measured this using a simple dd command for a big file (69GB) to rule out the mere 8GB RAM cache and a small file (4.3GB) that only hits the cache. In practical situations inserts and updates on the database will only hit the RAM cache and reading from database will be a mixture of reading from the 48GB of OS cache, the 8GB of on board cache on the Areca and finally the SSDs.

The big file (69GB):
dd if=/dev/zero of=/ssd/file.txt bs=512 x 2^[0..13] count=16k x 2^[13..0]
dd if=/ssd/file.txt of=/dev/zero bs=512 x 2^[0..13] 

The small file (4.3GB):
dd if=/dev/zero of=/ssd/file.txt bs=512 x 2^[0..13] count=1k x 2^[13..0]
dd if=/ssd/file.txt of=/dev/zero bs=512 x 2^[0..13] 

fig 31. sequential read, write in [MB/s] v.s. block-size
bw-sustained-raid5-10.png

IOMeter

Coming soon

PostgreSQL performance

Coming soon

Test specification
Server; SuperMicro X7DWN+, 2 x X5460, 48GB RAM, 2 x ARC1680IX-12, 12 x MSP7535-032.
RAID controller ARC1680IX-12; 4GB DDR2 533MHz ECC, Firmware 1.46 23-01-2009, SAS 4.4.3.0, HDD Read Ahead = Auto, Cache = Enabled, Disk Write Cache Mode = Auto
RAID configuration; 2x RAID5 5 + 1 hot spare SSDs per controller – 128kB stripe size, Tagged Queuing = Enabled, Cache Mode =Write Back
SSD MTRON MSP 7535-032; Firmware 0.18R1H3
OS: Linux 2 2.6.26-1-amd64 [Debian Lenny]
Software RAID configuration: lvm2 with two RAID controllers – 128kB stripe size – READ AHEAD 8192

Conclusion

SSDs perform great, specially for database servers, where lots of concurrent read and write operations are carried out. Tests show an overall performance improvement of ten times for our database server but a general performance improvement can not be given, it all depends on your file system usage.

When used wisely SSDs are not much more expensive then traditional hard-disks (In the future probably cheaper because you need less SSDs to outperform hard-disks) and consume less energy.

However, due to SSDs being a rather new technology a lot of testing is required. It thus takes some time before one can actually go into production and until we have solved or at least understand where the file system hickups originate from we will not go live with SSDs.


This benchmark was brought to you by Dennis Brouwer and Arjan Tijms of JDevelopment, an M4N team.

Building the new battleship Mtron

20 November 2008

A while back I stumbled upon the legendary article; Battleship Mtron, the absurdly fast RAID array build with 9 Mtron SSDs on a blazingly fast Areca ARC-1231ML, sporting an amazing 800 MHz Intel IOP341.

It was the fastest thing on the planet. Period.

A year has passed since then. At M4N we have been experimenting with an SSD setup consisting of 4 Mtron 7000 SSDs on a development server. After some extensive benchmarking, it appeared that in nearly all situations the IO power of these beasts is far superior to that of the traditional hard disk. A decision was made to order a bunch of SSDs to be placed in multiple servers. 12 of those arrived today, along with 2 Arceca ARC-1680IX-12’s equipped with a whopping 4GB of cache each and the fast 1.2Ghz IOP348.

Seeing all that hardware together however made us think. What if we… assemble it all together in -1- massive storage array? 12 Mtron 7500’s on 2 Arceca 1680’s (6 per controller), combining the power of the RAID sets of both Areca’s into 1 single volume using software RAID. What will be the performance of that?

Stay tuned for our upcoming benchmark reports!

mtrons
12 Mtron’s arrived at the office. Click for a larger image

Arjan

css.php best counter