Greetings. I started a new gig some two months back and have since enjoyed the company of geeks whose mandatory apparel includes laptop straps for mobile data center computing (not depicted) and the beauty that you see depicted to the right. The environment has thusfar fostered much creativity and freedom of technical thought, to say the least. That said, I thought I’d share some insights from recent investigations in performance characteristics on EMC’s DMX array based on actual I/O tests that were conducted by yours truly.
While working on a proposal for a customer, EMC generously provided some lab time on a DMX to do some benchmarking to provide some numbers that the customer was requesting.
Whenever we pitch storage solutions to customers, we’re typically looking at the IOPS requirements that the customer may have for their environment, whether for Oracle, Exchange, SQL, etc., and subsequently match that requirement with the proper spindle counts. Of course we still care about front-end bandwidth (2 Gb vs 4 Gb), cache, switches, etc. but typically it will come down to the spindle counts.
And so I began my test on the DMX with what I had, unfortunately with inadequate server resources, but you have to make do with what you’re given, right? The following is what I had to work with:
One (1) Dell 2850 with two (2) single-port 2 Gb Emulex LP9000 HBAs
One (1) Dell 1850 with one (1) single-port 2 Gb Emulex LP9000 HBA
One (1) DMX3-24 with 152x 300GB 10k drives, 96GB raw cache
One (1) McData DS-24M 2Gb switch
I ran a few tests, but I’ll focus on an easy one. Using Oracle’s ORION tool, I ran a simple test with 100% random read I/O, ranging from 8k to 1024k. With the next few graphs, I hope to show some comparisons between front-end and backend utilization that were interesting and informative.
Figure 1: FA Director IOPS
I had the 2850 mapped to FAs 9A0 and 10A0 and the 1850 mapped to FAs 7A0 and 8A0. Yes, I recognize that I didn’t use the nice rule of 17, but I originally had all four FAs mapped to the 2850 and later took two FAs away. Because of the way the zoning and lun masking was done, I had to do it that way. I was too lazy to want to reconfigure that stuff to make up the rule of 17.
The dual-HBA 2850 generally peaked at 11000 IOPS across both FAs (one HBA per FA), and the single-HBA 1850 generally peaked at 7000 IOPS across both FAs (one HBA for both FAs).
Figure 2: Average IO Size
You’ll see that the first half of the test comprised small I/Os (mostly 8k) while the second half of the test comprised large I/Os (peaking at rougly 750k). This, of course, makes sense when you look at Figure 1 above, and think of the general equation: IOPS * I/O Size = I/O Throughput. We were driving an aggregate 18000 IOPS using 8k I/Os in the first half of the test, while we were driving an aggregate of maybe 1000 IOPS using larger I/Os.
Figure 3: Disk IOPS
From a disk perspective, you’ll see that the disk IOPS matches right on with what the hosts were requesting at the front-end on the FAs.
Figure 4: FA Throughput
And finally, you’ll see that we’re getting about 145 MBps during the first half of the test while we’re getting about 535 MBps during the second half of the test.
When designing for performance on a DMX, there are a few things that need to be considered when looking at utilization at the component level. Let’s look at the following utilization graph and interpret it together.
Figure 5: DMX Utilization Report
FA CPU: 50-55% during both halves
DA CPU: 20% during the first half, <5% during the second half
At first glance, it appears that the front-end is doing much more work than the backend. However, the numbers are a bit misleading as, if you recall, we are only using four (4) of the possible sixteen (16) FA CPUs. So you can see at least that the load is being spread across more CPUs in the backend.
FA Channel Bandwidth: 20% during the first half, 80% during the second half
DA Channel Bandwidth: <5% during the first half, 5% during the second half
Again the same consideration needs to be taken here with channel utilization. Given that we are using a fraction of the FA CPUs, it looks like we’re utilizing the FA channels at 90% while we’re utilizing the DA channels at 10%.
The most telling statistic is the following:
FA Board: 7% during the first half, 15% during the second half
DA Board: 10% during both halves
So overall board utilization is relatively comparable, though individual components were utilized differently. A few observations:
– The dual HBA configuration appeared to plateau at roughly 11000 IOPS at 8k I/O sizes, while the single HBA configuration appear to plateau at roughly 7000 IOPS at 8k I/O sizes. Seems like that was the HBA limitation that I hit in this test.
– DA utilization dropped from 20% to <5% and disk utilization dropped from 90% to 40% from the first half to the second half of the test, showing that pre-fetching significantly improved utilization of backend resources.
– FA utilization remained relatively stable from the first half to the second half of the test, but FA channel utilization dramatically increased. This is interesting in that I would have thought that more, though smaller, I/Os would cause higher CPU utilization, but it appears that less but larger I/Os cause comparable CPU utilization.
I personally learned a lot about trying to benchmark a DMX. If I get another chance to do a benchmark in the future, there are some changes I’d make, a wishlist, if you will:
– Use a larger number of servers, at least 6-8 servers, each with dual HBA configurations
– Configure dual fabric connectivity across two separate switches
– Spread the load across all CPUs on the FA directors instead of just a subset
– Run the simple (100% random read I/O) and advanced tests (20% write, 80% random read I/O) again
– Dig deeper and check individual CPU utilization (FA and DA) using some of EMC’s tools
– Perform tests with TimeFinder to observe impact on the DA directors in the backend
But alas, that’s for another day.