Review: world’s first Supermicro 2026TT chassis

10 maart 2010, door development

Just after our house style has been redesigned in blue, we technicians go green. Our intention was to modernize our server park with exclusive 2.5″ chassis, and where long waiting for a 2U, 2.5″ chassis that would accommodate a lot of SSDs. We are already preparing a new database server (fitted with 24 x 2.5″ INTEL X25-E SSDs) based on the Supermicro SC826 chasis, but our XEN cluster would love something similar too!

The wait is over: we were shipped the world’s very first Supermicro 2026TT-HIBXRF 2U Server. While the blades more or less existed before, the speciality on this configuration is that all 6 Sata channels are connected to the backplane. This enables us to use all Sata ports for LVM setups, using SSDs for performance and power saving sake.

front1 300x116 Review: worlds first Supermicro 2026TT chassis

front1

front2 300x58 Review: worlds first Supermicro 2026TT chassis

front2

swap bays and switching panel details

swap bays and switching panel details

rear1 300x69 Review: worlds first Supermicro 2026TT chassis

rear1

rear connection details

rear connection details

Which goodies did we place in the blades:

  • CPU: 2 x Xeon X5570 (Nehalem, 4 cores, 8 threads, 2.93 GHz, 95 W)
  • RAM: 6 x Kingston KVR1333D3D4R9S/4G
  • HDD: 1 x Seagate ST9500530NS 500GB SATA
  • SSD: 4 x INTEL X25-M Postville SSDSA2MH160G2C1 (@FW 02HD)

On the HDD we installed debian lenny with XEN kernel Xen 3.2-1-amd64; DOM0 and DOMu are running 2.6.26-2-xen-amd64 kernels, the virtual machines will run on LVM volumes created using 4 SSDs. Since the INTEL SSDs are based on MLC cells we are taking a risk with potentially intensive writing; however, testing on our workstations showed that the life expectancy of this setup will be a couple of years. Manufacturers are planning longer living MLC SSDs at the end of this year, so replacements will be on hand pretty soon.

bios 300x225 Review: worlds first Supermicro 2026TT chassis

bios

What about power usage of such a server? I measured it quickly with no optimizations in the Linux kernel and with the Hyper-Threading option switched off in the BIOS:

  • STANDBY: 30 Watt
  • 1 BLADE: 210 Watt
  • 2 BLADES: 347 Watt
  • 3 BLADES: 499 Watt
  • 4 BLADES: 647 Watt

This comes down to ~ 150 Watt per blade (idle), and ~50 Watt for 4 x HDD, 16 x SSD, and some case fans - pretty cool, isn't it? What this setup will do on high load will be determined later; for now, the cooling conditions in our test room where far from optimal: with an ambient temperature of 27 degrees Celsius, the temperature within the chassis rose to 53 degrees. We have to wait for stress testing when the server is at its final destination, with much better cooling conditions.

What about XEN, LVM and SSD performance? We are not done yet, but measurements in DOM0 showed pretty nice figures. Since a hardware RAID solution using these twin blades is more or less off limits, software RAID is the best alternative. From our experience, we know that the xfs file system performs best for benchmarking (with schedulers set to deadline). After some sweet-spot measurements using lvm2, using 4 SSDs (RAID 0), we figured out that the following settings are best:

pvcreate --metadatasize 511K /dev/sdb /dev/sdc /dev/sdd /dev/sde
vgcreate xenvg-ssd /dev/sdb /dev/sdc /dev/sdd /dev/sde
lvcreate -i4 -I256 -L40G -n benchmark -n xenvg-ssd

Figures derived from this setup are not benched using IOZone, IOMeter or similar, but we used our own tools that will do the trick. For more information on this, please see: http://jdevelopment.nl/hardware/one-dvd-per-second/:

bm-flash:

Filling 4G before testing  ...   4096 MB done in 12 seconds (341 MB/sec).

Read Tests:

Block |   1 thread    |  10 threads   |  40 threads
 Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW
      |               |               |
 512B |  8695    4.2M | 58401   28.5M |153774   75.0M
   1K |  7712    7.5M | 54920   53.6M |148026  144.5M
   2K |  6455   12.6M | 46069   89.9M |134606  262.9M
   4K |  4909   19.1M | 35301  137.8M |103674  404.9M
   8K |  4516   35.2M | 32108  250.8M | 72833  569.0M
  16K |  3954   61.7M | 27518  429.9M | 43003  671.9M
  32K |  3262  101.9M | 19297  603.0M | 22875  714.8M
  64K |  2376  148.5M | 11136  696.0M | 11750  734.3M
 128K |  1665  208.1M |  5880  735.1M |  5933  741.7M
 256K |  1001  250.4M |  2979  744.7M |  2973  743.4M
 512K |   841  420.7M |  1415  707.5M |  1422  711.2M
   1M |   533  533.5M |   619  619.0M |   621  621.0M
   2M |   280  560.0M |   307  615.5M |   309  619.3M
   4M |   143  574.3M |   153  614.7M |   151  606.3M

Write Tests:

Block |   1 thread    |  10 threads   |  40 threads
 Size |  IOPS    BW   |  IOPS    BW   |  IOPS    BW
      |               |               |
 512B | 11062    5.4M | 21375   10.4M | 26693   13.0M
   1K |  6834    6.6M | 15384   15.0M | 22303   21.7M
   2K |  6244   12.1M | 13582   26.5M | 23145   45.2M
   4K |  7473   29.1M | 18849   73.6M | 25007   97.6M
   8K |  7106   55.5M | 24629  192.4M | 31830  248.6M
  16K |  7254  113.3M | 18285  285.7M | 23884  373.1M
  32K |  4842  151.3M |  8619  269.3M | 11580  361.8M
  64K |  2525  157.8M |  4604  287.7M |  5943  371.4M
 128K |  1319  164.8M |  2377  297.2M |  3048  381.0M
 256K |   561  140.4M |  1244  311.0M |  1531  382.7M
 512K |   368  184.0M |   745  372.8M |   778  389.3M
   1M |   335  335.2M |   381  381.8M |   401  401.5M
   2M |   174  348.1M |   192  385.7M |   210  421.0M
   4M |    91  364.7M |   103  414.0M |   107  428.3M

xdd:

Random READ tests:

          |      1 Thread |    10 Threads |    40 Threads |
Blocksize |   IOPS   MB/s |   IOPS   MB/s |   IOPS   MB/s |
          |               |               |               |
      512 |  13639      6 | 120414     61 | 186044     95 |
     1024 |  14256     14 | 109734    112 | 181448    185 |
     2048 |  12669     25 |  95246    195 | 171345    350 |
     4096 |  10302     42 |  75704    310 | 132238    541 |
     8192 |   8591     70 |  55870    457 |  78980    647 |
    16384 |   7244    118 |  35797    586 |  43133    706 |
    32768 |   5786    189 |  21985    720 |  22711    744 |

Sequential READ tests:

          |      1 Thread |    10 Threads |    40 Threads |
Blocksize |   IOPS   MB/s |   IOPS   MB/s |   IOPS   MB/s |
          |               |               |               |
      512 |  35796     18 | 119992     61 | 178309     91 |
     1024 |  34838     35 | 113584    116 | 170864    174 |
     2048 |  28590     58 |  97803    200 | 173524    355 |
     4096 |  19967     81 |  72748    297 | 134078    549 |
     8192 |  14151    115 |  57131    468 |  79959    655 |
    16384 |   9276    151 |  38128    624 |  43480    712 |
    32768 |   4460    146 |  22309    731 |  22812    747 |

Random WRITE tests:

          |      1 Thread |    10 Threads |    40 Threads |
Blocksize |   IOPS   MB/s |   IOPS   MB/s |   IOPS   MB/s |
          |               |               |               |
      512 |  23271     11 |  33162     16 |  40870     20 |
     1024 |  16571     16 |  26695     27 |  37758     38 |
     2048 |  16747     34 |  25156     51 |  34664     70 |
     4096 |  14019     57 |  24817    101 |  29577    121 |
     8192 |  12817    104 |  25704    210 |  30310    248 |
    16384 |  11149    182 |  15612    255 |  23467    384 |
    32768 |   6613    216 |   8525    279 |  12281    402 |

Sequential WRITE tests:

          |      1 Thread |    10 Threads |    40 Threads |
Blocksize |   IOPS   MB/s |   IOPS   MB/s |   IOPS   MB/s |
          |               |               |               |
      512 |  29471     15 |  36580     18 |  41892     21 |
     1024 |  26631     27 |  35478     36 |  36696     37 |
     2048 |  23431     47 |  32128     65 |  39953     81 |
     4096 |  22747     93 |  33924    138 |  40566    166 |
     8192 |  19811    162 |  23773    194 |  38880    318 |
    16384 |  12436    203 |  16751    274 |  24396    399 |
    32768 |   7470    244 |   8978    294 |  13039    427 |

Random READ/WRITE [90/10] tests:

          |      1 Thread |    10 Threads |    40 Threads |
Blocksize |   IOPS   MB/s |   IOPS   MB/s |   IOPS   MB/s |
          |               |               |               |
      512 |  14961      7 |  57521     29 |  85189     43 |
     1024 |  12284     12 |  43737     44 |  73368     75 |
     2048 |   9762     19 |  33229     68 |  66863    136 |
     4096 |   7366     30 |  27530    112 |  58668    240 |
     8192 |   6298     51 |  25379    207 |  48998    401 |
    16384 |   5283     86 |  20828    341 |  29309    480 |
    32768 |   4019    131 |  15410    504 |  19318    633 |

Sequential READ/WRITE [90/10] tests:

          |      1 Thread |    10 Threads |    40 Threads |
Blocksize |   IOPS   MB/s |   IOPS   MB/s |   IOPS   MB/s |
          |               |               |               |
      512 |  14278      7 |  72588     37 |  88427     45 |
     1024 |  10767     11 |  49585     50 |  73693     75 |
     2048 |   9110     18 |  34068     69 |  72447    148 |
     4096 |   7592     31 |  27516    112 |  66647    272 |
     8192 |   6271     51 |  26221    214 |  53545    438 |
    16384 |   5512     90 |  22818    373 |  33087    542 |
    32768 |   4138    135 |  16400    537 |  20735    679 |

sequential:

# dd if=/dev/zero of=/bench/xdd/S1 bs=8K count=2M
2097152+0 records in
2097152+0 records out
17179869184 bytes (17 GB) copied, 34.8562 s, 493 MB/s
# dd of=/dev/zero if=/bench/xdd/S1 bs=8K
2097152+0 records in
2097152+0 records out
17179869184 bytes (17 GB) copied, 27.2719 s, 630 MB/s
#  time cp /bench/xdd/S1 /bench/xdd/S0
real	1m8.972s
user	0m0.468s
sys	0m13.049s

More to come....


M4N iPhone app on iPad

10 februari 2010, door robin.eggenkamp

Recently, Apple announced the launch of the iPad. I downloaded the new SDK directly, to try our iPhone application. Unfortunately, it crashed immediately, without a clear error. I decided to wait for the next version, as I was not the only one with this kind of problem. Tonight, Apple released beta 2 of the SDK, so this morning I tried to run the app again. And now it works, thanks Apple!

m4n on ipad 787x1024 M4N iPhone app on iPad


Java EE 6 – release imminent

7 oktober 2009, door arjan

Yesterday the umbrella JSR for Java EE 6 finally released a proposed final draft. This means the release of Java EE 6, a fairly large release with many tweaks and additions, is relatively imminent.

I’ve been tracking its progress for some time now and collected references to a number of articles about the major parts of Java EE 6 here: Java EE 6 progress page.

Indeed, the two most influential open source implementations (Jboss AS and Glassfish) are about to release new versions supporting Java EE 6. Naturally, Glassfish is more up to date with their support, but this time Jboss AS is hot on their heels.

Alexis of Sun wrote yesterday that “We’re fast approaching a final release of v3 with full Java EE 6 support indeed.”

Meanwhile, Jason Green of Jboss announced about a month ago that “we can put out a 5.2 beta which includes some cool EE6 capabilities. The target release date is 10/15″, which would be around next week.

So, Glassfish is about to come with full support, while Jboss AS comes with some Java EE 6 support, but whatever your favorite implementation is, it seems that enterprise Java developers are in for some interesting times.


a hitchhikers guide to har2009.org galaxy

18 augustus 2009, door development

Een kort verslag van “a hitchhikers guide through hackers galaxy:
har2009.org“…Ergens op de Nederlandse heide ver weg van de bewoonde
wereld in het bos is een land wat Har heet. Een korte sfeer impressie.

1) Je moet gewoon een “uitgeprint” ticket meenemen anders kom je het
land niet in. (Kom je daar aan met je ticket op je iphone. Dat is
zooo 2009 we leven hier in wood-galaxy man.) Maar iedereen is wel
gewoon super aardig en met kopie paspoort + telnummer en het nummer op
je ticket kom je ook binnen.

2) Ze hebben daar i.p.v. gewone chipknips en pinpasjes alleen
plastieke muntjes waar je je bier mee kan kopen en eten halen. Er is
in har-Galaxy wel een wisselkantoor. Dus je kan wel met je pinpas
plastieke muntjes kopen.

3) Mobiel bellen. Tja, .. ,gewoon via een belachelijk hoog tarief het
lokale netwerk roamen zoals je in het buitenland doet, zat er in
Galaxy land niet in. Wel gratis mobiel bellen als je dan wel even alle
beveiligings hurdels overwint. Dat lukte mij als allochtoon niet. Dan
maar gewoon een van de 10 onbeveiligde wifi netwerken kiezen en mijn
iphone werd gelijk gedeeld met de rest van har2009. Het zijn hele
aardige wezens hoor alles zullen we eerlijk delen.

4) Netwerken doen de harrers ook uitgebreid, dat doe je door met je
laptop naast elkaar in de lounge te gaan zitten en je irc chat client
(een soort msn maar dan text based) of je nog mensen tegenkomt.

5) Communiceren doe je via je t-shirt. Wat is cool? Een t-shirt dat
bewijst dat je software code kan begrijpen die zo ingewikkeld is dat
alleen jij zo intelligent bent dat je dat kan ontcijferen. Gewoon een
interessante spreuk mag ook.

6) Een quiz, ook leuk doe je zo: je krijgt een antwoord en dan moet
je de vraag zelf verzinnen.

7) Ook de presentaties zijn interessant. Je gaat daar niet naar toe,
ook al is het op 500 meter afstand. Je blijft in je tent, zet het op
de beamer met boxen aan en luistert zo mee. Ok, langs gaan is ook ok,
toegegeven soms doen aardse mensen ook goede dingen.

8) En wat is er dan zo leuk? Wat is er leuk aan lego, het in elkaar
zetten van je bouwsel en dan als het af is er naar kijken en bedenken
zo die werkt. En dan om je heen kijken wat er nog meer te doen is. Zo
ook het op een camping voor elkaar krijgen dat je volledig online kan
programmeren en door dat met elkaar te doen info uitwisselen en
bedenken wat er allemaal nog te hacken/lees programeren valt. Samen
opbouwen ervan en meehelpen is the fun.

Har2009.nl was voor mij een korte reis terug in de tijd waar internet
werd gebouwd door geeks, techneuten en software engineers die allemaal
de wereld verbeteren en waar toevallige fondsten uitgroeien tot wereld
successen maar waar ook mensen de nodige issues aan de kaak stellen,
waar de bouwstenen van onze virtuele internet wereld van gebouwd
worden. Har2009 was een geweldige ervaring.

Bijgesloten wat foto’s uit de gewone wereld op zijn commercieels om te
bewijzen dat ik er echt ben geweest en dat bovenstaande echt waar is.
Het zijn geen sprookjes.


@ har2009 and copyright

14 augustus 2009, door development

Hi,

Op har2009 is het relaxed we hebben een heerlijke snelle uplink van meer dan 10 Gbit/s.
Het downloaden van data gaat super snel en natuurlijk is alle data legaal. Natuurlijk heb ik geen disk bij mij om het een en ander op te zetten.

We hebben een mooie discussie gevolgd over wat er moet gebeuren met copyright.
Het is jammer dat onze vrienden van mininova.org er niet bij waren.
We hebben nog geprobeerd ze in het pannel te krijgen maar dan was het te veel “wij” tegen Brein.
Jammer genoeg is de vraag niet gesteld over wat een copyright dan zou moeten kosten en hoe dat bepaald wordt.
Verder was het super leuk om weer een keer een interessante discussie te horen die nog redelijk geleidt werd.

Het interessante is dat affiliate marketing eigenlijk een beetje een oplossing geeft voor het copyright probleem. Je ziet dat er duidelijk vraag en aanbod is van dingen. Downloads van films worden misschien niet helemaal gratis maar er komt waarschijnlijk een beter prijs die er gevraagd kan worden. Sommige films zullen misschien veel meer betaald worden door merchandising i.p.v. prijs van ticket om er naar te kijken. Net als bij muziek moet je weer terug naar het model dat muzikanten betaald worden voor het optreden en niet voor de platen die ze vermarkten. Hoe dit met films zal woorden is waarschijnlijk dat je ze dan kan zien via imdb voor 1 euro per stuk i.p.v. 10 misschien afhankelijk van de kwaliteit die je wilt zien. Punt is alleen dat omdat je veel groter public bereikt je toch omzet of kleine winst kunt halen.

Men dacht dat Dell door direct aan de klant te gaan leveren alle PC fabrikanten zou verslaan maar ook zij gebruiken affiliate marketing en ook daar is er dus weer een middelman. Er is dus altijd een tussen party nodig die de juiste informatie voor je als klant (uit) zoekt. Hier wil je graag voor betalen. Vraag is natuurlijk of jij dat direct betaald of indirect via advertentie of producten die je via die site indirect koopt.

Een goed voorbeeld is wat ik zie dat er gebeurd met de kranten. Kranten klagen over hun inkomsten model maar doen allerlei andere dingen dan de juiste informatie uitzoeken. Dit heb ik vooral gemerkt op het gebied van global warming maar hier zal ik verder niet op in gaan.
Ik merk dat ze heel veel diverse dingen doen behalve goede nieuws leveren en filteren. Ik verwacht van een krant dat ze voor mij een goede selectie van nieuws leveren en daar wil ik graag wel wat voor betalen maar wat ze niet moeten doen dingen leveren die niet over nieuws gaan zoals “kook” recepten. De kranten zijn steeds minder bezig met goede nieuws filtering en steeds meer met andere dingen. Ze moeten weer terug naar hun basis als ze dat goed blijven doen zullen ze ook weer lezers krijgen. Goede informatie word overal geschreven maar selectie er van is just wel geld waard. Als je dan toch op advertenties komt is ook daar weer waarde voor te vinden. Als er ook nog advertenties geselecteerd worden die bij dat type gebruikers past zal die er ook geen probleem mee hebben. Kranten hebben het over hun inkomsten model, maar kijk eens hoeveel kosten er zijn bespaard in vooral hun vlak. Alle technische vernieuwingen hebben er voor gezorgt dat ze veel effectiever en veel makkelijker aan niews kunnen komen. Punt is ze moeten de juiste nieuws uitzoeken.

We krijgen nu een praatje over smurfen tellen. Dus dit moet wel leuk zijn.

Tot zover har2009.

Klaas Joosten


Ga met ons mee naar HAR2009!

10 augustus 2009, door development

M4N en dus JDevelopment is sponsor van har2009.
Ook zin om op een camping te staan met alleen maar techneuten?

Ben je geïnteresseerd in systeembeheer en beveiliging van netwerken en technology?

We verloten een ticket voor HAR2009 (Hacking at Random) camp.

Wanneer: 13 – 16 augustus

Waar:
De Paasheuvel
‘t Frusselt 30
8076 RE Vierhouten

Wat kun je doen?: Diverse workshops, zie Workshop.

Wil je graag mee? Stuur voor 11 augustus 18.00 uur een mail naar hr@m4n.nl met je motivatie waarom je graag naar HAR zou willen.
M4N laat je op 12 augustus via het forum weten wie het ticket heeft gewonnen.


IntelliJ – The IDE I wanted to love

27 juni 2009, door development

In the Java market we have a plethora of ide’s to choose from. The one I am most familiar with and also see used most often is Eclipse. Everyone knows Eclipse of course. Eclipse is not without its faults. So its natural to look around every so often. Eclipse is my baseline. At this point we arrive at IntelliJ. IntelliJ is another well known ide. I have heard IntelliJ described as the only ide for java developers, the ide that wil increase productivity 10-100 times. Its a commercial product which means you can have a reasonable expectation of stability, customer support and all the nifty tooling you would expect in Eclipse or any other ide. Naturally you expect it to be better than Eclipse or you would’t pay for it. I proceeded to download, install and put IntelliJ to use.

A few words on environment. I run Debian Lenny on a Dell Precision T-3400.

IntelliJ is an easy install. Download, unpack, run (,????, profit!). Its only 100MB compressed which is less than 150MB for Eclipse classic (and a lot less than 650MB for MyEclipse). You may need to check JDK_HOME variables but I’m guessing most people wil have this set up already. Otherwise what are you doing with a Java specific IDE. You may be confronted with a dialog or two. To begin with you should be safe with the defaults. Odds are you only need a subset of the defaults to begin with anyway.

My first impression after the install is that IntelliJ is very pretty. Especially after I changed the theme. This all went really well. Everything was fast and seemed well thought out. Of course IntelliJ has its own key mapping, menu ordering and things of that nature. If you like you can enable Eclipse key mappings but if you want to get to know a program rather than get to work quickly I recommend going with what’s totally unfamiliar. Although I am of the opinion that ctrl-s means save in any language.

The Subversion plugin works like a charm. I am even convinced it’s faster than Subversive…at least in our local environment where Subversion has, at times, been a pain to work with. If for no other reason, this plugin makes me want to use IntelliJ. If my work confined itself to talking to the repository I would be done. Forever.

So here we have a fresh working copy of our project, complete with Eclipse project files. IntelliJ understands Eclipse project files or at least so it claims. An import or two later and I have an official IntelliJ project. As far as I’m concerned I just need to point this puppy to Tomcat and we’re good to go.

Now IntelliJ has a lot of spiffy UI thingies. You got screens for configuring everything. You’ve got modules and facets. You’ve got raindrops on roses and warm woolen mittens. Confidently I configured the project to deploy to tomcat. Of course it wouldn’t work at once, that would be too much to ask. I was convinced any problems I was sure to encounter would be easy to solve.

My first problem comes in the form of compiler errors. “But wait,” you say, “did you not do a fresh checkout? Are your colleagues so daft that they enter broken code?” That would be yes and not usually. IntelliJ seems less permisive than eclipse in allowing certain constructs. Mind you Eclipse has no compile errors on the same code…so it can’t really be a compiler error since they use the same one. Ok, so its a precompile syntax validation check error or whatever. It’s not really a big deal.

Unfortunately I have no clue what IntelliJ is really doing. Somewhere there exists a tomcat. Somewhere there exists at least one copy of this tomcat. Somewhere you may or may not be deploying your project. Ok, I can actually find where IntelliJ hides its tomcat copy. Why can’t it just use my tomcat? I don’t know. Oh you can, sort of, but then you all of a sudden have to start and stop tomcat yourself. Not that I really care that IntelliJ uses a copy of tomcat, except that this copy still uses the originals directories. Sort of. I think. I’m not really sure.

A word on IntelliJ support. If you happen to write them they answer quickly. Really quickly. Actually the reply is there almost before you press compose mail. I was seriously impressed. Unfortunately they couldn’t help me and I dind’t want to rely heavily on support which is mostly meant for paying customers anyway. I also don’t like using support lines in general. It feels like a flaw in the product. Which it is.

You may be asking why I didn’t simply look up the answers in the documentation. IntelliJ documentation is one of the worst I have ever seen. It’s not worse than factually wrong information but it’s not better than a blank piece of paper. Basically if you have something labeled “Perform action” the documentation wil read “Performs the action Action”. The documentation will never actually explain what is happening or go into the details of anything.

So now I have a project which may or may not be deploying somewhere. After some looking around I do figure out I have lots of classpath problems. Also, my ant builds aren’t being run. Ok, add library here, ant build before deploy…Some progress made but not enough. My project still won’t work.

To make a long story short for about a week I’ve been playing with IntelliJ. I’ve installed, reinstalled, checked out, imported, etc, etc. I can’t get our project to work. This is a project which does work under Eclipse. So I blame IntelliJ. Where Eclipse always seems to have the thing you are looking for where you are looking for it, IntelliJ will make you search for it. Where Eclipse forces you to know what you’re doing and setup this and do that, IntelliJ has done it for you (mostly) and otherwise you can pretty much forget it.

To me IntelliJ is like a beautiful woman with a really anoying personality and a deaf ear. I really wanted to like it. Ask my colleagues, I am ready and willing to say that I hate something. I do it regularly and with passion. I do it to Eclipse all the time. Yet, despite all this, I really can’t hate IntelliJ. I would still really like to work with it. I’m convinced it could be really good. I can’t hate it and yet I can’t love it. For now it wil remain the IDE I wanted to love.


100K+ IOPS on semi-commodity hardware

1 juni 2009, door development

Abstract

A while ago we conducted a study to find the fastest IO subsystem that money can buy these days. The only restriction being that it should be build out of (semi) commodity hardware. That meant extremely expensive solutions where a sales representative needs to visit your office in order to get a price indication where off limits. We reported about our findings here: One DVD per second

The 65K IOPS barrier

When we examined the graphs resulting from our previous study, it was pretty obvious that we were hitting a bottleneck somewhere. No matter the amount of hardware that we threw at the system, all graphs quickly saturated at around a 65K IOPS limit. This is most apparent in the following graphs, which we repeat here from the previous study:

fig 1. random IOps v.s. block-size and threads
RAID0 8 SSDs

iops-random-raid0-8.png

fig 2. random IOps v.s. block-size and threads
RAID0 12 SSDs

iops-random-raid0-12.png

fig 3. random IOps v.s. block-size and threads
2×8xRAID0 (16 SSDs total)

iops-random-raid00-16.png

In these examples, maxing out the number of SSDs per controller (figure 2) or doubling the entire storage system and then striping these together (figure 3), does not yield any performance benefits with respect to the maximum random number of IOPS. We do see that the setup depicted in figure 3 gives us some performance benefits, but basically it only pushes two more data points (64k and 128k block sizes) towards the imaginary 60~65K barrier.

Alignment, stripe size and data stripe width

For SSD performance, correct alignment is a matter of the uttermost importance. A misaligned SSD will seriously under perform. In our previous study we did our best to find a correct alignment. Since we were using LVM to do the striping, we set the metadata size to 250. This is actually a little trick we found, and apparently others found out about too:

LVM likes to allocate 192k for its header information, and 192k is not a multiple of 128k. So if you are creating file systems as logical volumes, and you want those volume to be properly aligned you have to tell LVM that it should reserve slightly more space for its meta-data, so that the physical extents that it allocates for its logical volumes are properly aligned. Unfortunately, the way this is done is slightly baroque:

# pvcreate –metadatasize 250k /dev/sdb2
Physical volume “/dev/sdb2″ successfully created

Why 250k and not 256k? I can’t tell you — sometimes the LVM tools aren’t terribly intuitive. However, you can test to make sure that physical extents start at the proper offset by using:

See: Aligning filesystems to an ssds erase block size/

When digging some more into this, we learned about the concept “data stripe width”, which is defined as:

data width = array stripe size * number of datadisks in your array

E.g. for raid 6 it’s array stripe size * (number of disks in array -2) and for raid 5 it’s array stripe size * (number of disks in array -1). This is an important number when you are striping multiple smaller arrays (sub-arrays, or what we called ‘legs’) into a larger array. It gives you the amount of data that together occupies exactly 1 array stripe of each device that makes up the sub-array. It’s this number that you should work with when trying to align LVM.

Stevecs explains this rather well:

pvcreate –metadatasize X /dev/sda /dev/sdb
# X 511 or whatever it takes to start at 512KiB with padding)

vgcreate ssd -s Y /dev/sda /dev/sdb
# Y here is your physical extent size which you probably want to be a multiple of your data stripe width
#default is 4MiB and that’s fine as we our data stripe widths here are evenly divisable to that for 8K (32KiB) or 128K (512KiB)

lvcreate -i2 -IZ -L447G -n ssd-striped ssd
# Z here should be a multiple of your data stripe width (32KiB or 512KiB)

See: xtremesystems

Applying this theory did help in some cases, but the 65k barrier hadn’t moved an inch.

The Linux IO scheduler

Something we overlooked in the previous study is the fact that Linux has a configurable IO scheduler. Currently available are the following four:

  1. Complete Fair Queueing Scheduler (CFQ)
  2. Anticipatory IO Scheduler (AS)
  3. Deadline Scheduler
  4. No-op Scheduler

For an in-depth discussion of each scheduler, see: 3. Schedulers / Elevators

Especially the CFQ and AS schedulers are designed to optimize IO in such a way that seek latencies are prevented. The thing is of course, SSDs don’t have any notable seek delay, especially not relative to other blocks. It thus doesn’t matter at all whether you e.g. fetch blocks 1, 10 and 20 sequentially or totally at random. Setting the scheduler to noop, which doesn’t try to be smart and basically does nothing at all, improved overall performance again. However, there was STILL a clear saturation point in our performance graphs. The barrier was moved up a little, but it was still very much there.

The file-system

At this point we were frantically wading through the C source code of the Areca driver, monitoring the interrupts and basically grasping at every straw within our reach. It was then that my co-worker Dennis noticed the following when examining the jfs module:

modinfo jfs
filename: /lib/modules/2.6.26-1-amd64/kernel/fs/jfs/jfs.ko
license: GPL
author: Steve Best/Dave Kleikamp/Barry Arndt, IBM
description: The Journaled Filesystem (JFS)
depends: nls_base
vermagic: 2.6.26-1-amd64 SMP mod_unload modversions
parm: nTxBlock:Number of transaction blocks (max:65536) (int)
parm: nTxLock:Number of transaction locks (max:65536) (int)
parm: commit_threads:Number of commit threads (int)

After we also examined the source code of the jfs driver (hooray for open source, see jfs_txnmgr.c), it became clear that we had found our bottle neck. We thus tried with another high performance file-system that we had wanted to test before, but never found the time for. This appeared to be the break through that we were looking for. Basically by changing a J into an X in our setup, we more than doubled the amount of IOPS and got far more than the 100K IOPS we were targeting.

Figure 4 shows the difference between using JFS and the CFQ scheduler vs using XFS and the NOOP scheduler, both using the 2×6xRAID6 configuration (12 disks total). Results as before are obtained from the easy-co benchmark tool. Unfortunately, we changed our RAID controllers from a dual 1680 to a dual 1231 in the mean time and we didn’t redo the easy-co benchmark. It nevertheless clearly shows how the barrier has been broken:

fig 4. random IOps v.s. block-size
2×6xRAID0 (12 SSDs total)

iops-random-raid0-8.png

There are still some questions left unanswered. With this setup adding more disks to the array (from 2×6 to 2×8) did not improve the IOPS performance at all. Neither did changing the RAID level from 6 to 10, which theoretically should have given us another performance boost.

Alternative benchmark tools

In the previous study we only used easy-co’s bm-flash as our benchmark tool. For this study we verified our findings using another tool; xdd. Where bm-flash tests for a fixed amount of time (10 seconds per item), xdd tests for a given amount of data. This makes it easier to rule out caching effects by specifying an amount of data that surely doesn’t fit in the cache, thereby forcing the controller to actually go to disk. Since our setup uses 8GB cache, we used 16GB of data for each test run that was randomly read from a 64GB file use direct IO. Each test was repeated multiple times and the final result taken as an average over these runs (there was no notable difference between each run though).

Figure 5 shows the results for a 2×6xRAID6 setup, using 128 threads, LVM, array stripe size set to 128KB and the software stripe size set to 512KB. The following exceptions hold:

  • ssz = array stripe size 8KB, software stripe size 32KB
  • mdamd = mdadm used instead of lvm for software striping
  • raid10 = RAID10 used instead of RAID6

fig 5. random IOps v.s. block-size.
2×6xRAID6 (12 SSDs total)

xdd_2x6xraid6_jfs_vs_xfs.png

Note that this graph only shows part of the data that was shown in fig 4. Since XDD tests take a great deal longer than bm-flash tests and the interesting data is within the 1KB to 32KB range anyway, we decided to limit the XDD tests to that range.

When looking at figure 5 it becomes clear we’re seeing the exact same barrier as when using bm-flash. In this figure too, the graph shows that XFS breaks through this barrier. However, the actual number of reported IOPS are a good deal lower as that what bm-flash reported. Here too, going from 2×6xRAID6 to 2×8xRAID10 did not improve performance, although there is a slight performance difference for the larger block sizes 16KB and 32KB.

Conclusion

Performance tuning remains a difficult thing. Having powerful hardware is one thing, but extracting the raw power from it can be challenging to say the least. In this study we have looked at alignment, IO schedulers and file systems. Of course it doesn’t stop there. There are many more file systems and most file systems have their own additional tuning parameters. Especially XFS is rich in that area, but we have yet to look into that.

Benchmarking correctly is a whole other story. Its important to change only one single parameter between test runs meant for comparison, and to document all involved settings for each and every test run. Next to that it’s important to realize what it is that you’re testing. Since bm-flash only runs each sub-tests for 10 seconds, it’s likely that the cache is tested in addition to the actual disks. Also, in this study we only looked at random read performance, but write performance is important too. Taking write performance into account (pure writes, or doing a 90% read/10% write test) can paint a rather different picture of the performance characteristics.

Acknowledgement

Many thanks have to go to stevecs, who kindly provided lots of insight that helped us to tune our system. See this topic for additional details on that.


This benchmark was brought to you by Dennis Brouwer and Arjan Tijms of JDevelopment, an M4N team.


SSD performance improvements set LIVE

22 mei 2009, door development

In a previous blog entry (One DVD per second) we described how we build a new fast SSD based Database server and how we benchmarked it.

This week we put the new SSD DB server to the ultimate test, namely “Going Live”.

It turned out that our investments definitely payed off. Our main online application M4N has seen a massive speed improvement. We found an average performance increase of ~6 times for all queries. As a result, the initial average performance increase for the whole application is approximately a factor 2. Keep in mind that the whole application depends on more than just the main DB. Of course, the performance of the Java Application Server and the network bandwidth play important roles as well.

The performance increase is most apparent when simply browsing through the different pages of the application. The whole experience now feels very snappy and even the more complex pages load nearly instantly. We have statistics on our average page speeds so I included this in the pictures.

Picture one the improvements with testing:

Speed improvements new database server SSD Postgresql

Speed improvements new database server SSD Postgresql



Picture two the improvements with speed on execution of queries:

Statistics speed improvement Postgresql database

Statistics speed improvement Postgresql database



Finally, during the night we execute a slew of maintenance queries. Normally, we see a number like this after the script that triggers these queries has finished executing:


real 348m6.451s
user 0m4.912s
sys 0m0.808s

The other morning however we saw this:


real 30m34.747s
user 0m2.752s
sys 0m1.008s

This amounts to a performance increase of over 11 times.

For now we thus carefully conclude that using SSDs indeed improves performance by a large margin.


Postgresql the most advanced opensource Database and full text search.

23 april 2009, door development

We work with Postgresql and are very happy with the performance and the existence of the DB. We are also very happy with the new full text search.
But what happened while we searched on the Postgresql website searching for documentation?
We search on:
http://search.postgresql.org/search?q=All

Our result on the search term “ALL” was:

Your search for All returned no hits“.

Searching for LIMIT ALL or “LIMIT ALL” (more specifically the thing I was searching for) the “ALL” was ignored and hits including “LIMIT” only were returned.

I tried some more words like, “This”, “because” and “that”. None of them gave any result.

Look at the slogan on the right side of the site:
The world most advanced opensource database“?
This give us some really nice thoughts! ;-)

searchall Postgresql the most advanced opensource Database and full text search.


best counter