Archive for February, 2009

One DVD per second

20 February 2009


For modern database systems the main bottleneck is usually IO. In order to speed up our database we conducted a study to find the fastest IO system currently available. In this article we describe our approach and findings. The result: with our new database server filled with SSDs we can transfer the amount of data equivalent to nearly one DVD per second, 3.3GB/sec.


Our current SAS hard-disk based database server (6 disks, RAID10) was having an access time bottle-neck, too much concurrent database access caused the server to slow down considerably. In a previous article building-the-new-battleship-mtron we promised test results using our new hardware.

We can now finally show you some of the obtained test results and give you a taste of what SSD based filing systems will bring. Of course, we run into several firmware incompatibility issues using this emerging hardware, found motherboard PCI-E bottle-necks, discovered Linux kernel performance differences, software specific benchmarking problems and strange file system hickups, but most of this has been circumvented to produce the figures listed in this article.

Benchmarking was not a goal by itself, given the amount of hardware (which reflects a certain amount of hard cash), what would be the best performing – cost effective and fail save setup using this hardware?

Well, we think we figured that out. Where to start… first let us explain why we choose this hardware setup at the first place. And for those who want to jump to conclusions, be our guest.

Hardware, the outfit

SSD: they come in several flavors where the best taste is the SSD which is build using SLC memory; they last longer and generally perform better on writing. The downside: they are much more expensive. We already had some hands on experience with MTRON SSDs in the past so we choose the newer, better and faster MTRON 7500 Pro series. Of course there are Intel, MemoRight, OCZ and Transcend too but at that time of writing MTRON was the best performer on paper. We ordered 12 MTRON MSP7535-032 (32GB) and 4 extra MTRON MSP7535-064 (64GB) SSDs.

Raid controller: that should be a PCI-E x8 board to handle all the IO. Unfortunately, at the time of writing no PCI-e x16 boards where available (and not so many, if any, server motherboards do support PCI-E x16 slots). Cache and even more cache should be available (only useful when Battery Back Up Unit applied). And, this may sound odd, but specifically for us, the controller also needed to support SAS disks. Namely we wanted to re-deploy some of our SAS hard-disks for good old time’s sake. A logical choice was to go for the ARC1680IX-12. Not only did we already gained some experience on a test server with it’s smaller brother, the ARC1680IX-8 but the recent firmware release 1.45 put it nearly on par with the native SATA controllers of the same brand. Problematic for us is the fact that all current high performance RAID controllers have been designed with traditional hard disks in mind. This means that the RAID controller often becomes the bottleneck. To overcome this bottleneck we used not one, but two ARC1680IX-12s in parallel. The setup we used is depicted in figure A.

fig A. RAID05 dual controller setup

Server: the server motherboard selected is the Super Micro X7DWN+ that can hold up to 128GB of RAM and two Quad Core Xeon CPU’s. We equipped the board with two X5460 CPU’s (at that time the highest clocked Xeon, which comes in handy for CPU bound calculations) and a mere 48GB of RAM. The server will act as a database server, the more RAM the better so most of the data can then be hold in memory.

Casing: a couple of OS disks, legacy SAS hard-disks and 12 SSDs need a shelter, we have chosen the Super Micro SC846, a 4U server casing that nicely matches with our two 1680IX-12 controllers and comes with a redundant power supply, something which is truly enterprise worthy.

Firmware, the hurdles

First thing we did was to install the controller cards, wired the backplane, inserted an ordinary hard disk for the OS, installed Debian on it and inserted all SSDs in random order into the cabinet. Then the Reboot, and… show time!

No, not yet, the event log of the Areca controllers showed many Time-Out errors. Doing the regular stuff like building RAID arrays seemed impossible let alone to create a file system on the newly created block-devices.

Research revealed that the hardware came with different firmware as we used before. We found out that the 64GB SSDs where shipped with firmware 0.18R1H3, the same we used in our test server, while the 32GB SSDs came with 0.19R1, which was new to us. The drives having firmware 0.18R1H3 did work properly indeed. So we first asked our supplier to ship us the newer firmware 0.19R1H2 available at that time (sometimes newer is better). We flashed the disks to 0.19R1H2 but still no good; a lot of Time-Out errors appeared in the event log again. Then we figured out that sometimes older is better and asked MTRON if it was possible to downgrade the firmware and luckily it was. We where provided with the DOWN.EXE executable and the proper MTRON.MFI and we flashed the SSDs back to 0.18R1H3. Bingo… all seemed to work fine now. We recently tried to upgrade some SSDs to firmware 0.20R1 but that didn’t work either.

Not only MTRON releases firmware on a regular bases, but so does Areca. We therefore upgraded the ARC1680IX-12s to the 1.46 firmware released on 23-01-2009 and the whole ritual was performed again; flashing the SSDs to the latest version 0.20R1. At first glance no Time-Out errors where noticed after the flash, however building RAID arrays slowed down nearly ten times and a filing system created on the resulting block-device performed over 20 times worse as expected. We are back to 0.18R1H3 again. Currently Areca and Mtron are investigating in a joint effort what causes these problems. Some hurdles are taken but we are still running and didn’t finish yet, we will keep you posted. The latest news is that Areca was able to reproduce the problems, but located the problem to be within Intel’s IOP348 firmware. Of course Areca is unable to fix anything there, so it’s now Intel’s turn.

Motherboard, keeping the right PCI-E x8 lane

The motherboard comes with plenty of PCI-E x8 slots and we installed the RAID controllers in the slots that would give the best airflow in the casing. Since we have two controllers we created identical arrays on both controllers (lets call them “legs”). Tests showed that there was a read performance difference of some 25% between those legs. Swapping the controller cards, and even the SSDs showed that the cards where performing okay and the SSDs where not to blame either. When looking in the manual we discovered that the PCI-E x8 slots we tried where being driven by different chip-sets. Our board comes with an Intel 631xESB/632xESB IO controller chip (south-bridge) and the 5400MCH (north-bridge) where the 631xESB/632xESB IO controller chip performs less. If you are in the situation of putting any piece of hardware in your server be aware, or better test, which slot to take.

Put in figures: using the 5400MCH we where able to do a multi-threaded sequential and random READ with a maximum total bandwidth of 1750MB/s, while the 631xESB/632xESB IO controller stopped at 1400MB/s. WRITING was not affected by different slots, since this is bottle-necked by the disks them-selfs.

Benchmarking, the software as referee

This may be considered one of the most important aspects of deriving figures from a server; which benchmark software to use? Ideally that would be software for which a lot of results are already available for comparison. We evaluated some like: bonnie++, IOMeter, IOZone, bm-flash, Orion and some more, all with their own pros and cons.

We eventually settled down with bm-flash and IOMeter. bm-flash takes a fixed amount of time to finish and gives a quick impression on file system performance, while IOMeter with a default workload, is widely used so we have an abundance of figures for comparison. We dropped bonnie++ because under some conditions it will only show “++++” when no useful figures could be calculated. IOZone and Orion required a complete study to be interpreted.

In order to use IOMeter we had to drop Linux and installed Windows Server 2008. Dynamo (the test client) running on Linux and IOmeter GUI (manager) running on Windows are somehow incompatible and continuously caused crashes on our test server. For our escape to Windows Server 2008 we where stuck with NTFS.

For benchmarking Linux we chose for the default JFS file system because previous tests showed (in our case) that this used the least CPU overhead and performed better compared to ext3. One may go totally lost in file system tuning but we decided it was not doable to tune too many parameters at the time. Note that the mount options where set to have a minimum of writes to the disks:


Linux software striping, the odd and even lanes

In order to increase the bandwidth we wanted to make use of software striping and as a side effect doubling the amount of RAM cache. We created two identical legs on the controllers, they where plugged in equal performing lanes and installed the mdadm and lvm2 packages for Debian (Debian Lenny – Linux 2 2.6.26-1-amd64). For both mdadm and lvm2 we configured striped block-devices. Let the games begin….

Not to bother you with too many figures right now we discovered an interesting finding: block-devices created by mdadm performed about 10% better compared to lvm2. The performance on write bandwidth is quite similar but the number of IOPS and read bandwidth using mdadm is higher. When doing a survey on internet we found that more people complained about this performance difference and that file system alignment and read ahead settings should be properly set. But even when all is properly configured the difference remains. E.g we can do max 65000 IOPS using mdadm and only 59000 IOPS with lvm2, we can read (sequential single threaded) max 1050MB/s using mdadm and only 980MB/s by lvm2. These measurements were obtained with reasonable fresh/new SSDs. Later on performance degraded and there was seemingly no way to get the original performance back. The mdadm/lvm2 performance difference can be attributed to the Linux kernel: mdadm uses the md kernel module while lvm2 uses the dm kernel module and these are beyond our control. Due to system administrator support we will not use mdadm but will go with lvm2.

Note: we recently retested on Debian Lenny, now stable. The max read bandwidth (sequential single threaded) for both mdadm and lvm2 stopped at 990MB/s. The difference in IOPS remained (65000 mdadm v.s. 60000 lvm2) but where only noticeable for small block-sizes (512B – 1kB) and therefor of minor importance to us.

Hickups, the file system’s full stop

During our test runs we sometimes noticed that the performance of the system was below expectations, a typical test run using bm-flash would look like [RAID0, 12 SSDs attached to one controller]:

test:/benchmark# ./bm-flash /ssd/test.txt 

Filling 4G before testing  ...   4096 MB done in 3 seconds (1365 MB/sec).

Read Tests:

Block |   1 thread    |  10 threads   |  40 threads   
 Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW   
      |               |               |               
 512B | 11307    5.5M | 63751   31.1M | 63312   30.9M 
   1K | 22720   22.1M | 63717   62.2M | 63086   61.6M 
   2K | 24788   48.4M | 63510  124.0M | 62734  122.5M 
   4K | 29158  113.8M | 63088  246.4M | 62361  243.5M 
   8K | 30170  235.7M | 61824  483.0M | 61089  477.2M 
  16K | 26110  407.9M | 61337  958.3M | 60991  952.9M 
  32K | 19936  623.0M | 51250 1601.5M | 50853 1589.1M 
  64K | 13589  849.3M | 27716 1732.2M | 27717 1732.3M 
 128K |  8273 1034.2M | 13877 1734.7M | 13881 1735.1M 
 256K |  4828 1207.2M |  6942 1735.5M |  6947 1736.8M 
 512K |  2563 1281.5M |  3472 1736.1M |  3476 1738.4M 
   1M |  1447 1447.0M |  1736 1736.6M |  1740 1740.2M 
   2M |   796 1592.3M |   866 1732.5M |   868 1737.1M 
   4M |   407 1630.0M |   430 1720.7M |   437 1748.3M 

Write Tests:

Block |   1 thread    |  10 threads   |  40 threads   
 Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW   
      |               |               |               
 512B | 28529   13.9M | 29048   14.1M | 28567   13.9M 
   1K | 27759   27.1M | 28213   27.5M | 28053   27.3M 
   2K | 27393   53.5M | 28036   54.7M | 27513   53.7M 
   4K | 26289  102.6M | 25278   98.7M | 26446  103.3M 
   8K | 23068  180.2M | 22138  172.9M | 21824  170.5M 
  16K | 17426  272.2M | 16820  262.8M | 17404  271.9M 
  32K | 10607  331.4M | 10988  343.3M | 11173  349.1M 
  64K |  6208  388.0M |  6894  430.9M |  6941  433.8M 
 128K |  3611  451.4M |  3895  486.9M |  4019  502.4M 
 256K |  1999  499.9M |  2116  529.0M |  2060  515.2M 
 512K |   834  417.4M |  1074  537.3M |  1024  512.2M 
   1M |   412  412.6M |   596  596.6M |   561  561.6M 
   2M |   280  561.3M |   240  480.0M |   207  414.7M 
   4M |   153  615.1M |   137  551.5M |   119  478.7M

However, if we configured different RAID configuration we noticed “gaps” in performance, output would typically look like [RAID5, 4 SSDs attached to one controller]:

test:/benchmark# ./bm-flash /ssd/test.txt 

Filling 4G before testing  ...   4096 MB done in 3 seconds (1365 MB/sec).

Read Tests:

Block |   1 thread    |  10 threads   |  40 threads
 Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW
      |               |               |
 512B |    24   12.2K | 28061   13.7M | 63593   31.0M
   1K | 22599   22.0M | 63882   62.3M | 63582   62.0M
   2K | 24644   48.1M | 63869  124.7M | 63275  123.5M
   4K | 29294  114.4M | 63519  248.1M | 62923  245.7M
   8K | 30188  235.8M | 62176  485.7M | 61474  480.2M
  16K | 26207  409.4M | 61542  961.5M | 61231  956.7M
  32K | 19944  623.2M | 51450 1607.8M | 50947 1592.1M
  64K | 13661  853.8M | 27717 1732.3M | 27718 1732.4M
 128K |  8356 1044.5M | 13878 1734.7M | 13881 1735.2M
 256K |  4756 1189.0M |  6942 1735.6M |  6947 1736.9M
 512K |  2563 1281.8M |  3473 1736.5M |  3476 1738.4M
   1M |  1383 1383.0M |  1737 1737.0M |  1739 1739.7M
   2M |   795 1590.1M |   852 1704.1M |   866 1733.0M
   4M |   408 1632.0M |   430 1720.3M |   436 1744.0M

Write Tests:

Block |   1 thread    |  10 threads   |  40 threads
 Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW
      |               |               |
 512B |  5642    2.7M |  7221    3.5M |  9223    4.5M
   1K |  7851    7.6M |  5218    5.0M |  5815    5.6M
   2K |  4142    8.0M |  4357    8.5M |  3976    7.7M
   4K |  3314   12.9M |  3379   13.2M |  2890   11.2M
   8K |  2251   17.5M |  2018   15.7M |  2500   19.5M
  16K |  1980   30.9M |  2659   41.5M |  3499   54.6M
  32K |  1303   40.7M |  1217   38.0M |  1342   41.9M
  64K |   240   15.0M |   144    9.0M |   126    7.9M
 128K |   499   62.3M |    96   12.0M |   110   13.8M
 256K |   309   77.2M |    13    3.2M
 512K |     6    3.4M
   1M |   154  154.0M |    10   10.5M
   2M |    80  160.3M |    99  198.7M |    99  199.1M
   4M |    42  170.7M |    54  218.7M |    57  230.0M

Notice the drop in performance that starts at writing the 64kB blocks.

Tests with IOMeter showed that under certain conditions the whole controller became unresponsive for several seconds, we even measured file system freezes in the order of 10 seconds (both reading and writing). We are still puzzling what causes these freezes but we assume that after heavy writing, and the controllers cache being fully filled up, the controller first writes some cached data to disk and only then starts working properly, we addressed Areca about this phenomenon what the intended behavior should be. We will keep you posted.

Figures, the benchmark results

Those interested in hard figures probably are looking at the graphs already. We deliberately will not show all the test results, we graphed them where possible to have an overview of SSD performance in different RAID configuarions.

In short; we have 16 SSDs and two controllers that support a maximum of 12 SSDs each. We tested RAID0 and RAID5 but skipped RAID6 and RAID10 arrays for write performance on these was rather poor. We repeatitively tested 1 to 12 SSDs with RAID0 and 3 to 12 SSDs with RAID5 configuration. For lvm2 striped block-devices (using the two controllers with software RAID0) we repeatitively tested 2×4 to 2×8 SSDs per RAID0 and RAID5 configuration.

Emphasized are 8kB block-size figures because our database implementation is compiled to work with 8kB block-size.

Sequential read, write and file copy versus number of SSDs and RAID configuration

dd if=/dev/zero of=/ssd/file.txt bs=8K count=5M [write 40GB file]
dd if=/ssd/file.txt of=/dev/zero [read 40GB file]
cp /ssd/file.txt /ssd/copy-of-file.txt [copy 40GB file]

fig 1. sequential read, write bandwidth [MB/s] v.s. SSDs and RAID configuration
8kB block-size


fig 2. 40GB file copy [seconds] v.s. SSDs and RAID configuration (lower is better)
Conclusion: starting from 6 SSDs the controller’s read bandwidth saturates, while writing may benefit from adding additional SSDs. Usage of lvm2 software RAID0 and two controllers with an equal number of SSDs (8,10,12) always outperforms a single controller.

Random read, write bandwidth versus number of SSDs and RAID configuration

bm-flash /ssd/file.txt [random read, write and IOPS v.s block-size and threads]

fig 3. random read, write bandwidth in [MB/s] v.s. SSDs and RAID configuration
8kB block-size, 1 thread


fig 4. random read, write bandwidth in [MB/s] v.s. SSDs and RAID configuration
8kB block-size, 10 threads


fig 5. random read, write bandwidth in [MB/s] v.s. SSDs and RAID configuration
8kB block-size, 40 threads


Conclusion: bandwidth seems to be limited by the number of IO requests that can be spawned by a single thread (process). Scaling up the number of threads from 10 to 40 hardly influences the total bandwidth. Using lvm2 software RAID0 has a slight performance penalty on random read.

Random read, write bandwidth and IOPS versus block size and RAID configuration

bm-flash /ssd/file.txt [random read, write and IOPS v.s block-size and threads]

Note: since we concluded that 10 or 40 threads are not making much of a difference we only show the 1 and 10 thread random read, write bandwidth and IOPS graphs for readability.


fig 6. random read, write bandwidth in [MB/s] v.s. block-size and threads


fig 7. random IOps v.s. block-size and threads


fig 8. random read, write bandwidth in [MB/s] v.s. block-size and threads


fig 9. random IOps v.s. block-size and threads


fig 10. random read, write bandwidth in [MB/s] v.s. block-size and threads


fig 11. random IOps v.s. block-size and threads



fig 12. random read, write bandwidth in [MB/s] v.s. block-size and threads


fig 13. random IOps v.s. block-size and threads


fig 14. random read, write bandwidth in [MB/s] v.s. block-size and threads


fig 15. random IOps v.s. block-size and threads


fig 16. random read, write bandwidth in [MB/s] v.s. block-size and threads


fig 17. random IOps v.s. block-size and threads


RAID00 lvm2

fig 18. random read, write bandwidth in [MB/s] v.s. block-size and threads
2x4xRAID0 (8 SSDs total)


fig 19. random IOps v.s. block-size and threads
2x4xRAID0 (8 SSDs total)


fig 20. random read, write bandwidth in [MB/s] v.s. block-size and threads
2x6xRAID0 (12 SSDs total)


fig 21. random IOps v.s. block-size and threads
2x6xRAID0 (12 SSDs total)


fig 22. random read, write bandwidth in [MB/s] v.s. block-size and threads
2x8xRAID0 (16 SSDs total)


fig 23. random IOps v.s. block-size and threads
2x8xRAID0 (16 SSDs total)


As we can see in figures 18, 20 and 22, starting from a 256kB block-size, we finally reach our claimed 3.3 GB/sec. Although reading with 40 threads generally didn’t influence the results much, in this case however we already reached the mentioned 3.3 GB/sec consistently at a smaller block-size (128kB) using 40 threads. In figure 22 we show the results we obtained with 16 SSDs over 2 RAID controllers. However, the exact same read performance was also observed when using 8, 10, 12 and 14 SSDs over 2 RAID controllers, meaning read performance saturates quite rapidly and increasing the number of SSDs beyond a certain threshold doesn’t help to further improve read performance.

RAID05 lvm2

fig 24. random read, write bandwidth in [MB/s] v.s. block-size and threads
2x4xRAID5 (8 SSDs total)


fig 25. random IOps v.s. block-size and threads
2x4xRAID5 (8 SSDs total)


fig 26. random read, write bandwidth in [MB/s] v.s. block-size and threads
2x6xRAID5 (12 SSDs total)


fig 27. random IOps v.s. block-size and threads
2x6xRAID5 (12 SSDs total)


fig 28. random read, write bandwidth in [MB/s] v.s. block-size and threads
2x8xRAID5 (16 SSDs total)


fig 29. random IOps v.s. block-size and threads
2x8xRAID5 (16 SSDs total)


Figures 24 to 29 clearly show the awkward ‘gaps’ that we encountered multiple times during testing. The gaps shown here are of relatively minor severity (we saw ones that are much worse). As of now, we have no explanation for this and attribute it to compatibility problems between the Mtron SATA disks and the Areca SATA implementation (via the Intel IOP348 SATA stack).

RAID 5 – closer look at write performance

fig 30. random write only bandwidth in [MB/s] v.s. block-size and number of disks


Figure 30 zooms in on the write performance we measured for several RAID 5 configurations, using 10 threads. The general trend is clearly that more disks improve write performance, a trend that was not so clearly visible for read performance. However, the graph shown in figure 30 is also a cause for great concern. There are a couple of awkward gaps to be seen and especially the 2×6 configuration is problematic. The first few and few last measuring points of the 2×6 configuration are exactly on par with the 2×8 one, but in the entire mid section all performance is gone.

Intended production setup: lvm2 – using 2 x RAID5 – 5 + 1 hot spare SSDs

In the above mentioned tests we found that software striping outperformed any configuration on a single RAID controller. Therefor the intention is to configure a software RAID0 setup using lvm2 for performance improvement. We have chosen a hardware RAID5 setup of 5 SSDs and 1 hot spare, this gives us much better performance over RAID6 and offers redundancy. In case one SSD fails the RAID will rebuild itself using the hot spare in about 15 minutes. During this time window we are vulnerable for data loss but that is a rather small and acceptable risk.

The performance measured for this setup sits somewhere between the performance shown in figures 24, 26 and 25, 27. We will spare you another graph. What about the sequential read, write performance? We measured this using a simple dd command for a big file (69GB) to rule out the mere 8GB RAM cache and a small file (4.3GB) that only hits the cache. In practical situations inserts and updates on the database will only hit the RAM cache and reading from database will be a mixture of reading from the 48GB of OS cache, the 8GB of on board cache on the Areca and finally the SSDs.

The big file (69GB):
dd if=/dev/zero of=/ssd/file.txt bs=512 x 2^[0..13] count=16k x 2^[13..0]
dd if=/ssd/file.txt of=/dev/zero bs=512 x 2^[0..13] 

The small file (4.3GB):
dd if=/dev/zero of=/ssd/file.txt bs=512 x 2^[0..13] count=1k x 2^[13..0]
dd if=/ssd/file.txt of=/dev/zero bs=512 x 2^[0..13] 

fig 31. sequential read, write in [MB/s] v.s. block-size


Coming soon

PostgreSQL performance

Coming soon

Test specification
Server; SuperMicro X7DWN+, 2 x X5460, 48GB RAM, 2 x ARC1680IX-12, 12 x MSP7535-032.
RAID controller ARC1680IX-12; 4GB DDR2 533MHz ECC, Firmware 1.46 23-01-2009, SAS, HDD Read Ahead = Auto, Cache = Enabled, Disk Write Cache Mode = Auto
RAID configuration; 2x RAID5 5 + 1 hot spare SSDs per controller – 128kB stripe size, Tagged Queuing = Enabled, Cache Mode =Write Back
SSD MTRON MSP 7535-032; Firmware 0.18R1H3
OS: Linux 2 2.6.26-1-amd64 [Debian Lenny]
Software RAID configuration: lvm2 with two RAID controllers – 128kB stripe size – READ AHEAD 8192


SSDs perform great, specially for database servers, where lots of concurrent read and write operations are carried out. Tests show an overall performance improvement of ten times for our database server but a general performance improvement can not be given, it all depends on your file system usage.

When used wisely SSDs are not much more expensive then traditional hard-disks (In the future probably cheaper because you need less SSDs to outperform hard-disks) and consume less energy.

However, due to SSDs being a rather new technology a lot of testing is required. It thus takes some time before one can actually go into production and until we have solved or at least understand where the file system hickups originate from we will not go live with SSDs.

This benchmark was brought to you by Dennis Brouwer and Arjan Tijms of JDevelopment, an M4N team.

css.php best counter