Archive for the ‘Hardware’ Category

Review: world’s first Supermicro 2026TT chassis

10 March 2010

Just after our house style has been redesigned in blue, we technicians go green. Our intention was to modernize our server park with exclusive 2.5″ chassis, and where long waiting for a 2U, 2.5″ chassis that would accommodate a lot of SSDs. We are already preparing a new database server (fitted with 24 x 2.5″ INTEL X25-E SSDs) based on the Supermicro SC826 chasis, but our XEN cluster would love something similar too!

The wait is over: we were shipped the world’s very first Supermicro 2026TT-HIBXRF 2U Server. While the blades more or less existed before, the speciality on this configuration is that all 6 Sata channels are connected to the backplane. This enables us to use all Sata ports for LVM setups, using SSDs for performance and power saving sake.

front1

front1

front2

front2

swap bays and switching panel details

swap bays and switching panel details

rear1

rear1

rear connection details

rear connection details

Which goodies did we place in the blades:

  • CPU: 2 x Xeon X5570 (Nehalem, 4 cores, 8 threads, 2.93 GHz, 95 W)
  • RAM: 6 x Kingston KVR1333D3D4R9S/4G
  • HDD: 1 x Seagate ST9500530NS 500GB SATA
  • SSD: 4 x INTEL X25-M Postville SSDSA2MH160G2C1 (@FW 02HD)

On the HDD we installed debian lenny with XEN kernel Xen 3.2-1-amd64; DOM0 and DOMu are running 2.6.26-2-xen-amd64 kernels, the virtual machines will run on LVM volumes created using 4 SSDs. Since the INTEL SSDs are based on MLC cells we are taking a risk with potentially intensive writing; however, testing on our workstations showed that the life expectancy of this setup will be a couple of years. Manufacturers are planning longer living MLC SSDs at the end of this year, so replacements will be on hand pretty soon.

bios

bios

What about power usage of such a server? I measured it quickly with no optimizations in the Linux kernel and with the Hyper-Threading option switched off in the BIOS:

  • STANDBY: 30 Watt
  • 1 BLADE: 210 Watt
  • 2 BLADES: 347 Watt
  • 3 BLADES: 499 Watt
  • 4 BLADES: 647 Watt

This comes down to ~ 150 Watt per blade (idle), and ~50 Watt for 4 x HDD, 16 x SSD, and some case fans – pretty cool, isn’t it? What this setup will do on high load will be determined later; for now, the cooling conditions in our test room where far from optimal: with an ambient temperature of 27 degrees Celsius, the temperature within the chassis rose to 53 degrees. We have to wait for stress testing when the server is at its final destination, with much better cooling conditions.

What about XEN, LVM and SSD performance? We are not done yet, but measurements in DOM0 showed pretty nice figures. Since a hardware RAID solution using these twin blades is more or less off limits, software RAID is the best alternative. From our experience, we know that the xfs file system performs best for benchmarking (with schedulers set to deadline). After some sweet-spot measurements using lvm2, using 4 SSDs (RAID 0), we figured out that the following settings are best:

pvcreate --metadatasize 511K /dev/sdb /dev/sdc /dev/sdd /dev/sde
vgcreate xenvg-ssd /dev/sdb /dev/sdc /dev/sdd /dev/sde 
lvcreate -i4 -I256 -L40G -n benchmark -n xenvg-ssd

Figures derived from this setup are not benched using IOZone, IOMeter or similar, but we used our own tools that will do the trick. For more information on this, please see: http://jdevelopment.nl/hardware/one-dvd-per-second/:

bm-flash:

Filling 4G before testing  ...   4096 MB done in 12 seconds (341 MB/sec).

Read Tests:

Block |   1 thread    |  10 threads   |  40 threads
 Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW
      |               |               |
 512B |  8695    4.2M | 58401   28.5M |153774   75.0M
   1K |  7712    7.5M | 54920   53.6M |148026  144.5M
   2K |  6455   12.6M | 46069   89.9M |134606  262.9M
   4K |  4909   19.1M | 35301  137.8M |103674  404.9M
   8K |  4516   35.2M | 32108  250.8M | 72833  569.0M
  16K |  3954   61.7M | 27518  429.9M | 43003  671.9M
  32K |  3262  101.9M | 19297  603.0M | 22875  714.8M
  64K |  2376  148.5M | 11136  696.0M | 11750  734.3M
 128K |  1665  208.1M |  5880  735.1M |  5933  741.7M
 256K |  1001  250.4M |  2979  744.7M |  2973  743.4M
 512K |   841  420.7M |  1415  707.5M |  1422  711.2M
   1M |   533  533.5M |   619  619.0M |   621  621.0M
   2M |   280  560.0M |   307  615.5M |   309  619.3M
   4M |   143  574.3M |   153  614.7M |   151  606.3M

Write Tests:

Block |   1 thread    |  10 threads   |  40 threads
 Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW
      |               |               |
 512B | 11062    5.4M | 21375   10.4M | 26693   13.0M
   1K |  6834    6.6M | 15384   15.0M | 22303   21.7M
   2K |  6244   12.1M | 13582   26.5M | 23145   45.2M
   4K |  7473   29.1M | 18849   73.6M | 25007   97.6M
   8K |  7106   55.5M | 24629  192.4M | 31830  248.6M
  16K |  7254  113.3M | 18285  285.7M | 23884  373.1M
  32K |  4842  151.3M |  8619  269.3M | 11580  361.8M
  64K |  2525  157.8M |  4604  287.7M |  5943  371.4M
 128K |  1319  164.8M |  2377  297.2M |  3048  381.0M
 256K |   561  140.4M |  1244  311.0M |  1531  382.7M
 512K |   368  184.0M |   745  372.8M |   778  389.3M
   1M |   335  335.2M |   381  381.8M |   401  401.5M
   2M |   174  348.1M |   192  385.7M |   210  421.0M
   4M |    91  364.7M |   103  414.0M |   107  428.3M

xdd:

Random READ tests:

          |      1 Thread |    10 Threads |    40 Threads |
Blocksize |   IOPS   MB/s |   IOPS   MB/s |   IOPS   MB/s |
          |               |               |               |
      512 |  13639      6 | 120414     61 | 186044     95 |
     1024 |  14256     14 | 109734    112 | 181448    185 |
     2048 |  12669     25 |  95246    195 | 171345    350 |
     4096 |  10302     42 |  75704    310 | 132238    541 |
     8192 |   8591     70 |  55870    457 |  78980    647 |
    16384 |   7244    118 |  35797    586 |  43133    706 |
    32768 |   5786    189 |  21985    720 |  22711    744 |

Sequential READ tests:

          |      1 Thread |    10 Threads |    40 Threads |
Blocksize |   IOPS   MB/s |   IOPS   MB/s |   IOPS   MB/s |
          |               |               |               |
      512 |  35796     18 | 119992     61 | 178309     91 |
     1024 |  34838     35 | 113584    116 | 170864    174 |
     2048 |  28590     58 |  97803    200 | 173524    355 |
     4096 |  19967     81 |  72748    297 | 134078    549 |
     8192 |  14151    115 |  57131    468 |  79959    655 |
    16384 |   9276    151 |  38128    624 |  43480    712 |
    32768 |   4460    146 |  22309    731 |  22812    747 |

Random WRITE tests:

          |      1 Thread |    10 Threads |    40 Threads |
Blocksize |   IOPS   MB/s |   IOPS   MB/s |   IOPS   MB/s |
          |               |               |               |
      512 |  23271     11 |  33162     16 |  40870     20 |
     1024 |  16571     16 |  26695     27 |  37758     38 |
     2048 |  16747     34 |  25156     51 |  34664     70 |
     4096 |  14019     57 |  24817    101 |  29577    121 |
     8192 |  12817    104 |  25704    210 |  30310    248 |
    16384 |  11149    182 |  15612    255 |  23467    384 |
    32768 |   6613    216 |   8525    279 |  12281    402 |

Sequential WRITE tests:

          |      1 Thread |    10 Threads |    40 Threads |
Blocksize |   IOPS   MB/s |   IOPS   MB/s |   IOPS   MB/s |
          |               |               |               |
      512 |  29471     15 |  36580     18 |  41892     21 |
     1024 |  26631     27 |  35478     36 |  36696     37 |
     2048 |  23431     47 |  32128     65 |  39953     81 |
     4096 |  22747     93 |  33924    138 |  40566    166 |
     8192 |  19811    162 |  23773    194 |  38880    318 |
    16384 |  12436    203 |  16751    274 |  24396    399 |
    32768 |   7470    244 |   8978    294 |  13039    427 |

Random READ/WRITE [90/10] tests:

          |      1 Thread |    10 Threads |    40 Threads |
Blocksize |   IOPS   MB/s |   IOPS   MB/s |   IOPS   MB/s |
          |               |               |               |
      512 |  14961      7 |  57521     29 |  85189     43 |
     1024 |  12284     12 |  43737     44 |  73368     75 |
     2048 |   9762     19 |  33229     68 |  66863    136 |
     4096 |   7366     30 |  27530    112 |  58668    240 |
     8192 |   6298     51 |  25379    207 |  48998    401 |
    16384 |   5283     86 |  20828    341 |  29309    480 |
    32768 |   4019    131 |  15410    504 |  19318    633 |

Sequential READ/WRITE [90/10] tests:

          |      1 Thread |    10 Threads |    40 Threads |
Blocksize |   IOPS   MB/s |   IOPS   MB/s |   IOPS   MB/s |
          |               |               |               |
      512 |  14278      7 |  72588     37 |  88427     45 |
     1024 |  10767     11 |  49585     50 |  73693     75 |
     2048 |   9110     18 |  34068     69 |  72447    148 |
     4096 |   7592     31 |  27516    112 |  66647    272 |
     8192 |   6271     51 |  26221    214 |  53545    438 |
    16384 |   5512     90 |  22818    373 |  33087    542 |
    32768 |   4138    135 |  16400    537 |  20735    679 |

sequential:

# dd if=/dev/zero of=/bench/xdd/S1 bs=8K count=2M
2097152+0 records in
2097152+0 records out
17179869184 bytes (17 GB) copied, 34.8562 s, 493 MB/s
# dd of=/dev/zero if=/bench/xdd/S1 bs=8K
2097152+0 records in
2097152+0 records out
17179869184 bytes (17 GB) copied, 27.2719 s, 630 MB/s
#  time cp /bench/xdd/S1 /bench/xdd/S0
real	1m8.972s
user	0m0.468s
sys	0m13.049s

More to come….

100K+ IOPS on semi-commodity hardware

1 June 2009

Abstract

A while ago we conducted a study to find the fastest IO subsystem that money can buy these days. The only restriction being that it should be build out of (semi) commodity hardware. That meant extremely expensive solutions where a sales representative needs to visit your office in order to get a price indication where off limits. We reported about our findings here: One DVD per second

The 65K IOPS barrier

When we examined the graphs resulting from our previous study, it was pretty obvious that we were hitting a bottleneck somewhere. No matter the amount of hardware that we threw at the system, all graphs quickly saturated at around a 65K IOPS limit. This is most apparent in the following graphs, which we repeat here from the previous study:

fig 1. random IOps v.s. block-size and threads
RAID0 8 SSDs

iops-random-raid0-8.png

fig 2. random IOps v.s. block-size and threads
RAID0 12 SSDs

iops-random-raid0-12.png

fig 3. random IOps v.s. block-size and threads
2x8xRAID0 (16 SSDs total)

iops-random-raid00-16.png

In these examples, maxing out the number of SSDs per controller (figure 2) or doubling the entire storage system and then striping these together (figure 3), does not yield any performance benefits with respect to the maximum random number of IOPS. We do see that the setup depicted in figure 3 gives us some performance benefits, but basically it only pushes two more data points (64k and 128k block sizes) towards the imaginary 60~65K barrier.

Alignment, stripe size and data stripe width

For SSD performance, correct alignment is a matter of the uttermost importance. A misaligned SSD will seriously under perform. In our previous study we did our best to find a correct alignment. Since we were using LVM to do the striping, we set the metadata size to 250. This is actually a little trick we found, and apparently others found out about too:

LVM likes to allocate 192k for its header information, and 192k is not a multiple of 128k. So if you are creating file systems as logical volumes, and you want those volume to be properly aligned you have to tell LVM that it should reserve slightly more space for its meta-data, so that the physical extents that it allocates for its logical volumes are properly aligned. Unfortunately, the way this is done is slightly baroque:

# pvcreate –metadatasize 250k /dev/sdb2
Physical volume “/dev/sdb2″ successfully created

Why 250k and not 256k? I can’t tell you — sometimes the LVM tools aren’t terribly intuitive. However, you can test to make sure that physical extents start at the proper offset by using:

See: Aligning filesystems to an ssds erase block size/

When digging some more into this, we learned about the concept “data stripe width”, which is defined as:

data width = array stripe size * number of datadisks in your array

E.g. for raid 6 it’s array stripe size * (number of disks in array -2) and for raid 5 it’s array stripe size * (number of disks in array -1). This is an important number when you are striping multiple smaller arrays (sub-arrays, or what we called ‘legs’) into a larger array. It gives you the amount of data that together occupies exactly 1 array stripe of each device that makes up the sub-array. It’s this number that you should work with when trying to align LVM.

Stevecs explains this rather well:

pvcreate –metadatasize X /dev/sda /dev/sdb
# X 511 or whatever it takes to start at 512KiB with padding)

vgcreate ssd -s Y /dev/sda /dev/sdb
# Y here is your physical extent size which you probably want to be a multiple of your data stripe width
#default is 4MiB and that’s fine as we our data stripe widths here are evenly divisable to that for 8K (32KiB) or 128K (512KiB)

lvcreate -i2 -IZ -L447G -n ssd-striped ssd
# Z here should be a multiple of your data stripe width (32KiB or 512KiB)

See: xtremesystems

Applying this theory did help in some cases, but the 65k barrier hadn’t moved an inch.

The Linux IO scheduler

Something we overlooked in the previous study is the fact that Linux has a configurable IO scheduler. Currently available are the following four:

  1. Complete Fair Queueing Scheduler (CFQ)
  2. Anticipatory IO Scheduler (AS)
  3. Deadline Scheduler
  4. No-op Scheduler

For an in-depth discussion of each scheduler, see: 3. Schedulers / Elevators

Especially the CFQ and AS schedulers are designed to optimize IO in such a way that seek latencies are prevented. The thing is of course, SSDs don’t have any notable seek delay, especially not relative to other blocks. It thus doesn’t matter at all whether you e.g. fetch blocks 1, 10 and 20 sequentially or totally at random. Setting the scheduler to noop, which doesn’t try to be smart and basically does nothing at all, improved overall performance again. However, there was STILL a clear saturation point in our performance graphs. The barrier was moved up a little, but it was still very much there.

The file-system

At this point we were frantically wading through the C source code of the Areca driver, monitoring the interrupts and basically grasping at every straw within our reach. It was then that my co-worker Dennis noticed the following when examining the jfs module:

modinfo jfs
filename: /lib/modules/2.6.26-1-amd64/kernel/fs/jfs/jfs.ko
license: GPL
author: Steve Best/Dave Kleikamp/Barry Arndt, IBM
description: The Journaled Filesystem (JFS)
depends: nls_base
vermagic: 2.6.26-1-amd64 SMP mod_unload modversions
parm: nTxBlock:Number of transaction blocks (max:65536) (int)
parm: nTxLock:Number of transaction locks (max:65536) (int)
parm: commit_threads:Number of commit threads (int)

After we also examined the source code of the jfs driver (hooray for open source, see jfs_txnmgr.c), it became clear that we had found our bottle neck. We thus tried with another high performance file-system that we had wanted to test before, but never found the time for. This appeared to be the break through that we were looking for. Basically by changing a J into an X in our setup, we more than doubled the amount of IOPS and got far more than the 100K IOPS we were targeting.

Figure 4 shows the difference between using JFS and the CFQ scheduler vs using XFS and the NOOP scheduler, both using the 2x6xRAID6 configuration (12 disks total). Results as before are obtained from the easy-co benchmark tool. Unfortunately, we changed our RAID controllers from a dual 1680 to a dual 1231 in the mean time and we didn’t redo the easy-co benchmark. It nevertheless clearly shows how the barrier has been broken:

fig 4. random IOps v.s. block-size
2x6xRAID0 (12 SSDs total)

iops-random-raid0-8.png

There are still some questions left unanswered. With this setup adding more disks to the array (from 2×6 to 2×8) did not improve the IOPS performance at all. Neither did changing the RAID level from 6 to 10, which theoretically should have given us another performance boost.

Alternative benchmark tools

In the previous study we only used easy-co’s bm-flash as our benchmark tool. For this study we verified our findings using another tool; xdd. Where bm-flash tests for a fixed amount of time (10 seconds per item), xdd tests for a given amount of data. This makes it easier to rule out caching effects by specifying an amount of data that surely doesn’t fit in the cache, thereby forcing the controller to actually go to disk. Since our setup uses 8GB cache, we used 16GB of data for each test run that was randomly read from a 64GB file use direct IO. Each test was repeated multiple times and the final result taken as an average over these runs (there was no notable difference between each run though).

Figure 5 shows the results for a 2x6xRAID6 setup, using 128 threads, LVM, array stripe size set to 128KB and the software stripe size set to 512KB. The following exceptions hold:

  • ssz = array stripe size 8KB, software stripe size 32KB
  • mdamd = mdadm used instead of lvm for software striping
  • raid10 = RAID10 used instead of RAID6

fig 5. random IOps v.s. block-size.
2x6xRAID6 (12 SSDs total)

xdd_2x6xraid6_jfs_vs_xfs.png

Note that this graph only shows part of the data that was shown in fig 4. Since XDD tests take a great deal longer than bm-flash tests and the interesting data is within the 1KB to 32KB range anyway, we decided to limit the XDD tests to that range.

When looking at figure 5 it becomes clear we’re seeing the exact same barrier as when using bm-flash. In this figure too, the graph shows that XFS breaks through this barrier. However, the actual number of reported IOPS are a good deal lower as that what bm-flash reported. Here too, going from 2x6xRAID6 to 2x8xRAID10 did not improve performance, although there is a slight performance difference for the larger block sizes 16KB and 32KB.

Conclusion

Performance tuning remains a difficult thing. Having powerful hardware is one thing, but extracting the raw power from it can be challenging to say the least. In this study we have looked at alignment, IO schedulers and file systems. Of course it doesn’t stop there. There are many more file systems and most file systems have their own additional tuning parameters. Especially XFS is rich in that area, but we have yet to look into that.

Benchmarking correctly is a whole other story. Its important to change only one single parameter between test runs meant for comparison, and to document all involved settings for each and every test run. Next to that its important to realize what it is that you’re testing. Since bm-flash only runs each sub-tests for 10 seconds, it’s likely that the cache is tested in addition to the actual disks. Also, in this study we only looked at random read performance, but write performance is important too. Taking write performance into account (pure writes, or doing a 90% read/10% write test) can paint a rather different picture of the performance characteristics.

Acknowledgement

Many thanks have to go to stevecs, who kindly provided lots of insight that helped us to tune our system. See this topic for additional details on that.


This benchmark was brought to you by Dennis Brouwer and Arjan Tijms of JDevelopment, an M4N team.

SSD performance improvements set LIVE

22 May 2009

In a previous blog entry (One DVD per second) we described how we build a new fast SSD based Database server and how we benchmarked it.

This week we put the new SSD DB server to the ultimate test, namely “Going Live”.

It turned out that our investments definitely payed off. Our main online application M4N has seen a massive speed improvement. We found an average performance increase of ~6 times for all queries. As a result, the initial average performance increase for the whole application is approximately a factor 2. Keep in mind that the whole application depends on more than just the main DB. Of course, the performance of the Java Application Server and the network bandwidth play important roles as well.

The performance increase is most apparent when simply browsing through the different pages of the application. The application feels very snappy and even the more complex pages load nearly instantly. We have statistics on our average page speeds so I included this in the pictures.

Picture one the improvements with testing:

Speed improvements new database server SSD Postgresql

Speed improvements new database server SSD Postgresql

Picture two the improvements with speed on execution of queries:

Statistics speed improvement Postgresql database

Statistics speed improvement Postgresql database

Finally, during the night we execute a slew of maintenance queries. Normally, we see a number like this after the script that triggers these queries has finished executing:


real 348m6.451s
user 0m4.912s
sys 0m0.808s

The other morning however we saw this:


real 30m34.747s
user 0m2.752s
sys 0m1.008s

This amounts to a performance increase of over 11 times.

For now we thus carefully conclude that using SSDs indeed improves performance by a large margin.

One DVD per second

20 February 2009

Abstract

For modern database systems the main bottleneck is usually IO. In order to speed up our database we conducted a study to find the fastest IO system currently available. In this article we describe our approach and findings. The result: with our new database server filled with SSDs we can transfer the amount of data equivalent to nearly one DVD per second, 3.3GB/sec.

Introduction

Our current SAS hard-disk based database server (6 disks, RAID10) was having an access time bottle-neck, too much concurrent database access caused the server to slow down considerably. In a previous article building-the-new-battleship-mtron we promised test results using our new hardware.

We can now finally show you some of the obtained test results and give you a taste of what SSD based filing systems will bring. Of course, we run into several firmware incompatibility issues using this emerging hardware, found motherboard PCI-E bottle-necks, discovered Linux kernel performance differences, software specific benchmarking problems and strange file system hickups, but most of this has been circumvented to produce the figures listed in this article.

Benchmarking was not a goal by itself, given the amount of hardware (which reflects a certain amount of hard cash), what would be the best performing – cost effective and fail save setup using this hardware?

Well, we think we figured that out. Where to start… first let us explain why we choose this hardware setup at the first place. And for those who want to jump to conclusions, be our guest.

Hardware, the outfit

SSD: they come in several flavors where the best taste is the SSD which is build using SLC memory; they last longer and generally perform better on writing. The downside: they are much more expensive. We already had some hands on experience with MTRON SSDs in the past so we choose the newer, better and faster MTRON 7500 Pro series. Of course there are Intel, MemoRight, OCZ and Transcend too but at that time of writing MTRON was the best performer on paper. We ordered 12 MTRON MSP7535-032 (32GB) and 4 extra MTRON MSP7535-064 (64GB) SSDs.

Raid controller: that should be a PCI-E x8 board to handle all the IO. Unfortunately, at the time of writing no PCI-e x16 boards where available (and not so many, if any, server motherboards do support PCI-E x16 slots). Cache and even more cache should be available (only useful when Battery Back Up Unit applied). And, this may sound odd, but specifically for us, the controller also needed to support SAS disks. Namely we wanted to re-deploy some of our SAS hard-disks for good old time’s sake. A logical choice was to go for the ARC1680IX-12. Not only did we already gained some experience on a test server with it’s smaller brother, the ARC1680IX-8 but the recent firmware release 1.45 put it nearly on par with the native SATA controllers of the same brand. Problematic for us is the fact that all current high performance RAID controllers have been designed with traditional hard disks in mind. This means that the RAID controller often becomes the bottleneck. To overcome this bottleneck we used not one, but two ARC1680IX-12s in parallel. The setup we used is depicted in figure A.

fig A. RAID05 dual controller setup
dual_raid_setup.png

Server: the server motherboard selected is the Super Micro X7DWN+ that can hold up to 128GB of RAM and two Quad Core Xeon CPU’s. We equipped the board with two X5460 CPU’s (at that time the highest clocked Xeon, which comes in handy for CPU bound calculations) and a mere 48GB of RAM. The server will act as a database server, the more RAM the better so most of the data can then be hold in memory.

Casing: a couple of OS disks, legacy SAS hard-disks and 12 SSDs need a shelter, we have chosen the Super Micro SC846, a 4U server casing that nicely matches with our two 1680IX-12 controllers and comes with a redundant power supply, something which is truly enterprise worthy.

Firmware, the hurdles

First thing we did was to install the controller cards, wired the backplane, inserted an ordinary hard disk for the OS, installed Debian on it and inserted all SSDs in random order into the cabinet. Then the Reboot, and… show time!

No, not yet, the event log of the Areca controllers showed many Time-Out errors. Doing the regular stuff like building RAID arrays seemed impossible let alone to create a file system on the newly created block-devices.

Research revealed that the hardware came with different firmware as we used before. We found out that the 64GB SSDs where shipped with firmware 0.18R1H3, the same we used in our test server, while the 32GB SSDs came with 0.19R1, which was new to us. The drives having firmware 0.18R1H3 did work properly indeed. So we first asked our supplier to ship us the newer firmware 0.19R1H2 available at that time (sometimes newer is better). We flashed the disks to 0.19R1H2 but still no good; a lot of Time-Out errors appeared in the event log again. Then we figured out that sometimes older is better and asked MTRON if it was possible to downgrade the firmware and luckily it was. We where provided with the DOWN.EXE executable and the proper MTRON.MFI and we flashed the SSDs back to 0.18R1H3. Bingo… all seemed to work fine now. We recently tried to upgrade some SSDs to firmware 0.20R1 but that didn’t work either.

Not only MTRON releases firmware on a regular bases, but so does Areca. We therefore upgraded the ARC1680IX-12s to the 1.46 firmware released on 23-01-2009 and the whole ritual was performed again; flashing the SSDs to the latest version 0.20R1. At first glance no Time-Out errors where noticed after the flash, however building RAID arrays slowed down nearly ten times and a filing system created on the resulting block-device performed over 20 times worse as expected. We are back to 0.18R1H3 again. Currently Areca and Mtron are investigating in a joint effort what causes these problems. Some hurdles are taken but we are still running and didn’t finish yet, we will keep you posted. The latest news is that Areca was able to reproduce the problems, but located the problem to be within Intel’s IOP348 firmware. Of course Areca is unable to fix anything there, so it’s now Intel’s turn.

Motherboard, keeping the right PCI-E x8 lane

The motherboard comes with plenty of PCI-E x8 slots and we installed the RAID controllers in the slots that would give the best airflow in the casing. Since we have two controllers we created identical arrays on both controllers (lets call them “legs”). Tests showed that there was a read performance difference of some 25% between those legs. Swapping the controller cards, and even the SSDs showed that the cards where performing okay and the SSDs where not to blame either. When looking in the manual we discovered that the PCI-E x8 slots we tried where being driven by different chip-sets. Our board comes with an Intel 631xESB/632xESB IO controller chip (south-bridge) and the 5400MCH (north-bridge) where the 631xESB/632xESB IO controller chip performs less. If you are in the situation of putting any piece of hardware in your server be aware, or better test, which slot to take.

Put in figures: using the 5400MCH we where able to do a multi-threaded sequential and random READ with a maximum total bandwidth of 1750MB/s, while the 631xESB/632xESB IO controller stopped at 1400MB/s. WRITING was not affected by different slots, since this is bottle-necked by the disks them-selfs.

Benchmarking, the software as referee

This may be considered one of the most important aspects of deriving figures from a server; which benchmark software to use? Ideally that would be software for which a lot of results are already available for comparison. We evaluated some like: bonnie++, IOMeter, IOZone, bm-flash, Orion and some more, all with their own pros and cons.

We eventually settled down with bm-flash and IOMeter. bm-flash takes a fixed amount of time to finish and gives a quick impression on file system performance, while IOMeter with a default workload, is widely used so we have an abundance of figures for comparison. We dropped bonnie++ because under some conditions it will only show “++++” when no useful figures could be calculated. IOZone and Orion required a complete study to be interpreted.

In order to use IOMeter we had to drop Linux and installed Windows Server 2008. Dynamo (the test client) running on Linux and IOmeter GUI (manager) running on Windows are somehow incompatible and continuously caused crashes on our test server. For our escape to Windows Server 2008 we where stuck with NTFS.

For benchmarking Linux we chose for the default JFS file system because previous tests showed (in our case) that this used the least CPU overhead and performed better compared to ext3. One may go totally lost in file system tuning but we decided it was not doable to tune too many parameters at the time. Note that the mount options where set to have a minimum of writes to the disks:

rw,noatime,nodiratime

Linux software striping, the odd and even lanes

In order to increase the bandwidth we wanted to make use of software striping and as a side effect doubling the amount of RAM cache. We created two identical legs on the controllers, they where plugged in equal performing lanes and installed the mdadm and lvm2 packages for Debian (Debian Lenny – Linux 2 2.6.26-1-amd64). For both mdadm and lvm2 we configured striped block-devices. Let the games begin….

Not to bother you with too many figures right now we discovered an interesting finding: block-devices created by mdadm performed about 10% better compared to lvm2. The performance on write bandwidth is quite similar but the number of IOPS and read bandwidth using mdadm is higher. When doing a survey on internet we found that more people complained about this performance difference and that file system alignment and read ahead settings should be properly set. But even when all is properly configured the difference remains. E.g we can do max 65000 IOPS using mdadm and only 59000 IOPS with lvm2, we can read (sequential single threaded) max 1050MB/s using mdadm and only 980MB/s by lvm2. These measurements were obtained with reasonable fresh/new SSDs. Later on performance degraded and there was seemingly no way to get the original performance back. The mdadm/lvm2 performance difference can be attributed to the Linux kernel: mdadm uses the md kernel module while lvm2 uses the dm kernel module and these are beyond our control. Due to system administrator support we will not use mdadm but will go with lvm2.

Note: we recently retested on Debian Lenny, now stable. The max read bandwidth (sequential single threaded) for both mdadm and lvm2 stopped at 990MB/s. The difference in IOPS remained (65000 mdadm v.s. 60000 lvm2) but where only noticeable for small block-sizes (512B – 1kB) and therefor of minor importance to us.

Hickups, the file system’s full stop

During our test runs we sometimes noticed that the performance of the system was below expectations, a typical test run using bm-flash would look like [RAID0, 12 SSDs attached to one controller]:

test:/benchmark# ./bm-flash /ssd/test.txt 

Filling 4G before testing  ...   4096 MB done in 3 seconds (1365 MB/sec).

Read Tests:

Block |   1 thread    |  10 threads   |  40 threads   
 Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW   
      |               |               |               
 512B | 11307    5.5M | 63751   31.1M | 63312   30.9M 
   1K | 22720   22.1M | 63717   62.2M | 63086   61.6M 
   2K | 24788   48.4M | 63510  124.0M | 62734  122.5M 
   4K | 29158  113.8M | 63088  246.4M | 62361  243.5M 
   8K | 30170  235.7M | 61824  483.0M | 61089  477.2M 
  16K | 26110  407.9M | 61337  958.3M | 60991  952.9M 
  32K | 19936  623.0M | 51250 1601.5M | 50853 1589.1M 
  64K | 13589  849.3M | 27716 1732.2M | 27717 1732.3M 
 128K |  8273 1034.2M | 13877 1734.7M | 13881 1735.1M 
 256K |  4828 1207.2M |  6942 1735.5M |  6947 1736.8M 
 512K |  2563 1281.5M |  3472 1736.1M |  3476 1738.4M 
   1M |  1447 1447.0M |  1736 1736.6M |  1740 1740.2M 
   2M |   796 1592.3M |   866 1732.5M |   868 1737.1M 
   4M |   407 1630.0M |   430 1720.7M |   437 1748.3M 

Write Tests:

Block |   1 thread    |  10 threads   |  40 threads   
 Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW   
      |               |               |               
 512B | 28529   13.9M | 29048   14.1M | 28567   13.9M 
   1K | 27759   27.1M | 28213   27.5M | 28053   27.3M 
   2K | 27393   53.5M | 28036   54.7M | 27513   53.7M 
   4K | 26289  102.6M | 25278   98.7M | 26446  103.3M 
   8K | 23068  180.2M | 22138  172.9M | 21824  170.5M 
  16K | 17426  272.2M | 16820  262.8M | 17404  271.9M 
  32K | 10607  331.4M | 10988  343.3M | 11173  349.1M 
  64K |  6208  388.0M |  6894  430.9M |  6941  433.8M 
 128K |  3611  451.4M |  3895  486.9M |  4019  502.4M 
 256K |  1999  499.9M |  2116  529.0M |  2060  515.2M 
 512K |   834  417.4M |  1074  537.3M |  1024  512.2M 
   1M |   412  412.6M |   596  596.6M |   561  561.6M 
   2M |   280  561.3M |   240  480.0M |   207  414.7M 
   4M |   153  615.1M |   137  551.5M |   119  478.7M

However, if we configured different RAID configuration we noticed “gaps” in performance, output would typically look like [RAID5, 4 SSDs attached to one controller]:

test:/benchmark# ./bm-flash /ssd/test.txt 

Filling 4G before testing  ...   4096 MB done in 3 seconds (1365 MB/sec).

Read Tests:

Block |   1 thread    |  10 threads   |  40 threads
 Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW
      |               |               |
 512B |    24   12.2K | 28061   13.7M | 63593   31.0M
   1K | 22599   22.0M | 63882   62.3M | 63582   62.0M
   2K | 24644   48.1M | 63869  124.7M | 63275  123.5M
   4K | 29294  114.4M | 63519  248.1M | 62923  245.7M
   8K | 30188  235.8M | 62176  485.7M | 61474  480.2M
  16K | 26207  409.4M | 61542  961.5M | 61231  956.7M
  32K | 19944  623.2M | 51450 1607.8M | 50947 1592.1M
  64K | 13661  853.8M | 27717 1732.3M | 27718 1732.4M
 128K |  8356 1044.5M | 13878 1734.7M | 13881 1735.2M
 256K |  4756 1189.0M |  6942 1735.6M |  6947 1736.9M
 512K |  2563 1281.8M |  3473 1736.5M |  3476 1738.4M
   1M |  1383 1383.0M |  1737 1737.0M |  1739 1739.7M
   2M |   795 1590.1M |   852 1704.1M |   866 1733.0M
   4M |   408 1632.0M |   430 1720.3M |   436 1744.0M

Write Tests:

Block |   1 thread    |  10 threads   |  40 threads
 Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW
      |               |               |
 512B |  5642    2.7M |  7221    3.5M |  9223    4.5M
   1K |  7851    7.6M |  5218    5.0M |  5815    5.6M
   2K |  4142    8.0M |  4357    8.5M |  3976    7.7M
   4K |  3314   12.9M |  3379   13.2M |  2890   11.2M
   8K |  2251   17.5M |  2018   15.7M |  2500   19.5M
  16K |  1980   30.9M |  2659   41.5M |  3499   54.6M
  32K |  1303   40.7M |  1217   38.0M |  1342   41.9M
  64K |   240   15.0M |   144    9.0M |   126    7.9M
 128K |   499   62.3M |    96   12.0M |   110   13.8M
 256K |   309   77.2M |    13    3.2M
 512K |     6    3.4M
   1M |   154  154.0M |    10   10.5M
   2M |    80  160.3M |    99  198.7M |    99  199.1M
   4M |    42  170.7M |    54  218.7M |    57  230.0M

Notice the drop in performance that starts at writing the 64kB blocks.

Tests with IOMeter showed that under certain conditions the whole controller became unresponsive for several seconds, we even measured file system freezes in the order of 10 seconds (both reading and writing). We are still puzzling what causes these freezes but we assume that after heavy writing, and the controllers cache being fully filled up, the controller first writes some cached data to disk and only then starts working properly, we addressed Areca about this phenomenon what the intended behavior should be. We will keep you posted.

Figures, the benchmark results

Those interested in hard figures probably are looking at the graphs already. We deliberately will not show all the test results, we graphed them where possible to have an overview of SSD performance in different RAID configuarions.

In short; we have 16 SSDs and two controllers that support a maximum of 12 SSDs each. We tested RAID0 and RAID5 but skipped RAID6 and RAID10 arrays for write performance on these was rather poor. We repeatitively tested 1 to 12 SSDs with RAID0 and 3 to 12 SSDs with RAID5 configuration. For lvm2 striped block-devices (using the two controllers with software RAID0) we repeatitively tested 2×4 to 2×8 SSDs per RAID0 and RAID5 configuration.

Emphasized are 8kB block-size figures because our database implementation is compiled to work with 8kB block-size.

Sequential read, write and file copy versus number of SSDs and RAID configuration

dd if=/dev/zero of=/ssd/file.txt bs=8K count=5M [write 40GB file]
dd if=/ssd/file.txt of=/dev/zero [read 40GB file]
cp /ssd/file.txt /ssd/copy-of-file.txt [copy 40GB file]

fig 1. sequential read, write bandwidth [MB/s] v.s. SSDs and RAID configuration
8kB block-size

sequential-8k-1-thread.png

fig 2. 40GB file copy [seconds] v.s. SSDs and RAID configuration (lower is better)
40gb-file-copy.png
Conclusion: starting from 6 SSDs the controller’s read bandwidth saturates, while writing may benefit from adding additional SSDs. Usage of lvm2 software RAID0 and two controllers with an equal number of SSDs (8,10,12) always outperforms a single controller.

Random read, write bandwidth versus number of SSDs and RAID configuration

bm-flash /ssd/file.txt [random read, write and IOPS v.s block-size and threads]

fig 3. random read, write bandwidth in [MB/s] v.s. SSDs and RAID configuration
8kB block-size, 1 thread

8k-random-1-thread.png

fig 4. random read, write bandwidth in [MB/s] v.s. SSDs and RAID configuration
8kB block-size, 10 threads

8k-random-10-threads.png

fig 5. random read, write bandwidth in [MB/s] v.s. SSDs and RAID configuration
8kB block-size, 40 threads

8k-random-40-threads.png

Conclusion: bandwidth seems to be limited by the number of IO requests that can be spawned by a single thread (process). Scaling up the number of threads from 10 to 40 hardly influences the total bandwidth. Using lvm2 software RAID0 has a slight performance penalty on random read.

Random read, write bandwidth and IOPS versus block size and RAID configuration

bm-flash /ssd/file.txt [random read, write and IOPS v.s block-size and threads]

Note: since we concluded that 10 or 40 threads are not making much of a difference we only show the 1 and 10 thread random read, write bandwidth and IOPS graphs for readability.

RAID 0

fig 6. random read, write bandwidth in [MB/s] v.s. block-size and threads
RAID0 4 SSDs

bw-random-raid0-4.png

fig 7. random IOps v.s. block-size and threads
RAID0 4 SSDs

iops-random-raid0-4.png

fig 8. random read, write bandwidth in [MB/s] v.s. block-size and threads
RAID0 8 SSDs

bw-random-raid0-8.png

fig 9. random IOps v.s. block-size and threads
RAID0 8 SSDs

iops-random-raid0-8.png

fig 10. random read, write bandwidth in [MB/s] v.s. block-size and threads
RAID0 12 SSDs

bw-random-raid0-12.png

fig 11. random IOps v.s. block-size and threads
RAID0 12 SSDs

iops-random-raid0-12.png

RAID 5

fig 12. random read, write bandwidth in [MB/s] v.s. block-size and threads
RAID5 4 SSDs

bw-random-raid5-4.png

fig 13. random IOps v.s. block-size and threads
RAID5 4 SSDs

iops-random-raid5-4.png

fig 14. random read, write bandwidth in [MB/s] v.s. block-size and threads
RAID5 8 SSDs

bw-random-raid5-8.png

fig 15. random IOps v.s. block-size and threads
RAID5 8 SSDs

iops-random-raid5-8.png

fig 16. random read, write bandwidth in [MB/s] v.s. block-size and threads
RAID5 12 SSDs

bw-random-raid5-12.png

fig 17. random IOps v.s. block-size and threads
RAID5 12 SSDs

iops-random-raid5-12.png

RAID00 lvm2

fig 18. random read, write bandwidth in [MB/s] v.s. block-size and threads
2x4xRAID0 (8 SSDs total)

bw-random-raid00-8.png

fig 19. random IOps v.s. block-size and threads
2x4xRAID0 (8 SSDs total)

iops-random-raid00-8.png

fig 20. random read, write bandwidth in [MB/s] v.s. block-size and threads
2x6xRAID0 (12 SSDs total)

bw-random-raid00-12.png

fig 21. random IOps v.s. block-size and threads
2x6xRAID0 (12 SSDs total)

iops-random-raid00-12.png

fig 22. random read, write bandwidth in [MB/s] v.s. block-size and threads
2x8xRAID0 (16 SSDs total)

bw-random-raid00-16.png

fig 23. random IOps v.s. block-size and threads
2x8xRAID0 (16 SSDs total)

iops-random-raid00-16.png

As we can see in figures 18, 20 and 22, starting from a 256kB block-size, we finally reach our claimed 3.3 GB/sec. Although reading with 40 threads generally didn’t influence the results much, in this case however we already reached the mentioned 3.3 GB/sec consistently at a smaller block-size (128kB) using 40 threads. In figure 22 we show the results we obtained with 16 SSDs over 2 RAID controllers. However, the exact same read performance was also observed when using 8, 10, 12 and 14 SSDs over 2 RAID controllers, meaning read performance saturates quite rapidly and increasing the number of SSDs beyond a certain threshold doesn’t help to further improve read performance.

RAID05 lvm2

fig 24. random read, write bandwidth in [MB/s] v.s. block-size and threads
2x4xRAID5 (8 SSDs total)

bw-random-raid05-81.png

fig 25. random IOps v.s. block-size and threads
2x4xRAID5 (8 SSDs total)

iops-random-raid05-8.png

fig 26. random read, write bandwidth in [MB/s] v.s. block-size and threads
2x6xRAID5 (12 SSDs total)

bw-random-raid05-12.png

fig 27. random IOps v.s. block-size and threads
2x6xRAID5 (12 SSDs total)

iops-random-raid05-12.png

fig 28. random read, write bandwidth in [MB/s] v.s. block-size and threads
2x8xRAID5 (16 SSDs total)

bw-random-raid05-16.png

fig 29. random IOps v.s. block-size and threads
2x8xRAID5 (16 SSDs total)

iops-random-raid05-161.png

Figures 24 to 29 clearly show the awkward ‘gaps’ that we encountered multiple times during testing. The gaps shown here are of relatively minor severity (we saw ones that are much worse). As of now, we have no explanation for this and attribute it to compatibility problems between the Mtron SATA disks and the Areca SATA implementation (via the Intel IOP348 SATA stack).

RAID 5 – closer look at write performance

fig 30. random write only bandwidth in [MB/s] v.s. block-size and number of disks

bw-random-raid5-multiple.png

Figure 30 zooms in on the write performance we measured for several RAID 5 configurations, using 10 threads. The general trend is clearly that more disks improve write performance, a trend that was not so clearly visible for read performance. However, the graph shown in figure 30 is also a cause for great concern. There are a couple of awkward gaps to be seen and especially the 2×6 configuration is problematic. The first few and few last measuring points of the 2×6 configuration are exactly on par with the 2×8 one, but in the entire mid section all performance is gone.

Intended production setup: lvm2 – using 2 x RAID5 – 5 + 1 hot spare SSDs

In the above mentioned tests we found that software striping outperformed any configuration on a single RAID controller. Therefor the intention is to configure a software RAID0 setup using lvm2 for performance improvement. We have chosen a hardware RAID5 setup of 5 SSDs and 1 hot spare, this gives us much better performance over RAID6 and offers redundancy. In case one SSD fails the RAID will rebuild itself using the hot spare in about 15 minutes. During this time window we are vulnerable for data loss but that is a rather small and acceptable risk.

The performance measured for this setup sits somewhere between the performance shown in figures 24, 26 and 25, 27. We will spare you another graph. What about the sequential read, write performance? We measured this using a simple dd command for a big file (69GB) to rule out the mere 8GB RAM cache and a small file (4.3GB) that only hits the cache. In practical situations inserts and updates on the database will only hit the RAM cache and reading from database will be a mixture of reading from the 48GB of OS cache, the 8GB of on board cache on the Areca and finally the SSDs.

The big file (69GB):
dd if=/dev/zero of=/ssd/file.txt bs=512 x 2^[0..13] count=16k x 2^[13..0]
dd if=/ssd/file.txt of=/dev/zero bs=512 x 2^[0..13] 

The small file (4.3GB):
dd if=/dev/zero of=/ssd/file.txt bs=512 x 2^[0..13] count=1k x 2^[13..0]
dd if=/ssd/file.txt of=/dev/zero bs=512 x 2^[0..13] 

fig 31. sequential read, write in [MB/s] v.s. block-size
bw-sustained-raid5-10.png

IOMeter

Coming soon

PostgreSQL performance

Coming soon

Test specification
Server; SuperMicro X7DWN+, 2 x X5460, 48GB RAM, 2 x ARC1680IX-12, 12 x MSP7535-032.
RAID controller ARC1680IX-12; 4GB DDR2 533MHz ECC, Firmware 1.46 23-01-2009, SAS 4.4.3.0, HDD Read Ahead = Auto, Cache = Enabled, Disk Write Cache Mode = Auto
RAID configuration; 2x RAID5 5 + 1 hot spare SSDs per controller – 128kB stripe size, Tagged Queuing = Enabled, Cache Mode =Write Back
SSD MTRON MSP 7535-032; Firmware 0.18R1H3
OS: Linux 2 2.6.26-1-amd64 [Debian Lenny]
Software RAID configuration: lvm2 with two RAID controllers – 128kB stripe size – READ AHEAD 8192

Conclusion

SSDs perform great, specially for database servers, where lots of concurrent read and write operations are carried out. Tests show an overall performance improvement of ten times for our database server but a general performance improvement can not be given, it all depends on your file system usage.

When used wisely SSDs are not much more expensive then traditional hard-disks (In the future probably cheaper because you need less SSDs to outperform hard-disks) and consume less energy.

However, due to SSDs being a rather new technology a lot of testing is required. It thus takes some time before one can actually go into production and until we have solved or at least understand where the file system hickups originate from we will not go live with SSDs.


This benchmark was brought to you by Dennis Brouwer and Arjan Tijms of JDevelopment, an M4N team.

Building the new battleship Mtron

20 November 2008

A while back I stumbled upon the legendary article; Battleship Mtron, the absurdly fast RAID array build with 9 Mtron SSDs on a blazingly fast Areca ARC-1231ML, sporting an amazing 800 MHz Intel IOP341.

It was the fastest thing on the planet. Period.

A year has passed since then. At M4N we have been experimenting with an SSD setup consisting of 4 Mtron 7000 SSDs on a development server. After some extensive benchmarking, it appeared that in nearly all situations the IO power of these beasts is far superior to that of the traditional hard disk. A decision was made to order a bunch of SSDs to be placed in multiple servers. 12 of those arrived today, along with 2 Arceca ARC-1680IX-12’s equipped with a whopping 4GB of cache each and the fast 1.2Ghz IOP348.

Seeing all that hardware together however made us think. What if we… assemble it all together in -1- massive storage array? 12 Mtron 7500’s on 2 Arceca 1680’s (6 per controller), combining the power of the RAID sets of both Areca’s into 1 single volume using software RAID. What will be the performance of that?

Stay tuned for our upcoming benchmark reports!

mtrons
12 Mtron’s arrived at the office. Click for a larger image

Arjan

css.php best counter