Aspects of Remote Replication Performance in Symmetrix DMX and Symmetrix V-Max Series with Enginuity

May 21, 2009 at 1:58 pm (Information)

V-Max Baseline

  • FC Performance improved by up to 36%
    • Write latency can be as low as 1 ms for small block size and zero distance
    • Single round trip continue to deliver significant savings in long distance
    • Maximal FC IO rate is 5000 IOPS
    • Maximal throughput up to 260 MB/s per RA
  • GigE Performance
    • Write latency can be as low as 1.25ms
    • Maximal IO is 4000 IOPS
    • Maximal throughput is about 90 MB/s per link (1 Gb link limit)
  • Note that these are maximum utilization limits. Don’t plan to reach director limits as part of the design.

SRDF Copy over FC Improved by 36%

  • Up to 520 MB/s from 380 MB/s with two FC links

FC Protocol

  • Two-phase FCP protocol for every write
    • Send command, ready receive
    • Send data, status acknowledge
  • This is fine in short distance but becomes prohibitive in longer distances. This could make FC worse than GigE if nothing is done about it.
  • Sound-round trip was introduced in DMX3/4 in 5772
    • Send command, send data unsolicited
    • Target buffers the data, send status
  • On average can save 1ms per 100km of distance
  • New to Symmetrix V-Max, improved buffering enables Single Round Trip for any I/O size (only done on small I/O for DMX)
  • This mode is configured as on and will start working in this mode when the latency reaches a certain threshold.

Scalability

  • Support for 250 SRDF groups per array
  • Up to 64 groups on a single port
  • Support for up to 32 SRDF ports in a fully loaded array
  • Concurrent writes in OS can run in parallel on all SRDF ports
  • qoS can set priority among competing SRDF/S volumes
  • DCP (dynamic cache partitioning) and SRDF/A cache limits can be used to set priorities among competing SRDF/A groups
  • SRDF performance scales linearly. For example, four RA directors can do about four times as much as one RA director in both DMX and V-Max
  • Same response time between 1, 32, 64, 250 groups (4KB write cache hit, 8 FC links, SRDF/S)

SRDF/A Architecture

  • Write-folding to repeat writes into a cycle are sent once
  • Response times are at 0.3ms. V-Max will allow you to go farther in terms of total IOPS (50k IOPS vs 42k IOPS) before response times begin to rapidly increase
  • Small impact even with smaller cycle times (30 seconds, 5 seconds, 2 seconds), but this is dependent on link bandwidth

SRDF/A Performance Considerations

  • Correct planning is key, cache, bandwidth, future workload increase, resiliency features (DSE, transmit idle)
  • Long distance networks (see SRDF/A Network Performance Analysis and Troubleshooting)

Three-site Solutions

  • Concurrent SRDF
    • R11 — S –> R2
    • R11 — A –> R2
  • Cascaded SRDF
    • R1 — S –> R21 — A –> R2
  • SRDF/Star for differential from B to C in the concurrent model
  • SRDF/EDP (Extended Distance Protection) through Cascaded SRDF
    • R1 — S –> R21 — A –> R2
    • Production site > Pass-thru Site > Out-of-region Site
    • Here the R21 is diskless, and the site is a pass-thru site
    • No data loss at an out-of-region site at a lower cost, if only the production site goes down.
    • DSE volumes can be put behind the R21
    • The middle site needs to be V-Max
    • SRDF/S has the best response time.
    • SRDF/A gets higher response times as the cycle times shorten.

Virtual Provisioning with SRDF

  • Storage is allocated in extents of 768KB
  • SRDF operations continue to be supported with the granularity of blocks (512 bytes)
  • SRDF supports connecting a VP device to another VP device (thin to thin only)
  • Indirection penalty with every write
    • Noticable at high I/O rates
    • Consequently more SRDF CPU power is needed to achieve equivalent performance
    • Maximal 2500 IOPS (roughly, still doing testing and engineering)

Control Operations Speed-up

  • New architecture for SRDF meta-data in V-Maxx
    • Dynamically allocated
    • Continuous in memory
    • Support for bulk operations
    • Therefore all control operations can complete faster
  • Increased scalability
    • Commands are broken into multiple extents, executed in parallel
  • As a result:
    • Faster dynamic SRDF operations
    • Faster failover and failback operations
    • Faster SRDF re-synchronization
    • Therefore lower RTO
  • Speed ups
    • Need V-Max to V-Max, 5874
    • Create a new srdf pair (symrdf createpair -invalidate) from 54 to 7 seconds
    • Full sRDF establish (symrdf establish -full) from 53 to 4 seconds
    • Incremental establish after split (symrdf establish) from 6 to 3 seconds
    • Failback (symrdf failback) from 47 to 19 seconds

Permalink Leave a Comment

Achieving the Best Performance from Virtual Provisioning on Symmetrix

May 21, 2009 at 12:32 pm (Information)

TDAT – thin device pool

TDEV – thin data device, which has extents or chunks (allocation unit is 12 tracks, or 768KB)

Enabling technologies

  • TDEV, TDAT, and pools
  • IVTOC (VTOC now doesn’t happen when the bin file is loaded, it’s VTOC’ed when it’s written, so there’s a performance impact for this) (will this be like a COFW penalty?)
  • FE-enabled pre-fetch

Write Flow

  • TDEV has an allocation table, pointers to physical storage
  • Rather than point directly to a track on the TDAT, it points to a different allocation table at the end of the TDAT.
  • The allocation table (id_tables?) in the TDAT then points to the track group (12 tracks) on the TDAT.
  • Capacity is allocated round robin to all the TDATs.

Wide Striping

  • Drives are carved up into TDATs and added to a thin pool. The drives are spread out everywhere across al the DA pairs. Looks like 3PAR chunklets.
  • Spreads workload more evenly across all spindles.
  • No need for Symm Optimizer, not applicable.
  • Some results from lab testing. You’ll notice that more devices increases IOPS. This is because of the increase in the number of available queues for IO.
    • Random read miss for 256x devices on 480x R1 drives got up to 110k IOPS (~229 IOPS per drive). Uniform workload.
    • Random read miss for 480x devices on 480x R1 drives got up to 120k IOPS (~250 IOPS per drive). Uniform workload.
  • Is there an optimal size for a pool? Large is obviously better but size could be dictated by drive types or application workloads.
  • Would you create separate pools for an Exchange database and log?

Thin Data Device Considerations (TDAT)

  • Protection Type
    • What kind of RAID type am I going to use?
    • The pools are a fixed RAID type. The pool will inherit the RAID type of the first TDAT added.
    • OLTP workloads – R1 2 hyper > R5 3+1 4 hyper > R6 14+2 16 hyper
    • DSS workloads – R1 2 hyper > R5 3+1 4 hyper > R6 14+2 16 hyper
    • You may consider R6 over R5 despite the write performance impact on R6 because a double drive failure would significant impact more volumes, since it will have TDATs for many more thin volumes.
  • Configuration Best Practices
    • Data devices should all reside on the same rotational speed.
    • Data devices should be spread evenly across as many Das and drives as possible.
    • Data devices should be the same size, if possible. Uneven sizes could result in uneven data distribution.
    • Fewer, larger devices is better than many, tiny devices.
    • Expand pools in large chunks to avoid allocations that use a few TDATs.
  • Expanding Pools
    • Best model would be doubling of the pool size.
    • No mechanism today to balance out the TDATs. Coming but not there yet.

Thin Device Considerations (TDEV)

  • First Write
    • Case 1: Unallocated
      • Allocate extent
      • VTOC track, pad if necessary
      • For random writes, response time goes from 0.5ms to 4.0ms for the first random write. With 16KB writes, 47 other writes will use this extent (remember it’s 768KB)
      • For sequential writes, response time is much better because they utilize the allocated extents. Doing 64KB, 11 out of every 12 write will ride for free. (0.6ms)
    • Case 2: Pre-allocated
      • VTOC track, pad if necessary
      • When you bind the TDEV, you can pre-allocate the tables to avoid the penalty.
      • For random writes, it looks like the sequential write in case 1. Low IVTOC impact (0.6ms)
      • For sequential writes, it looks the same (0.6ms).
    • Case 3: Pre-written
      • Clear to write
  • Reads
    • Sequential read streams
      • Now data is on multiple physical spindles
      • Pre-fetch mechanism changes in 73 code. It’s now in the front-end FA. Used to be in the back-end DA.
      • The front-end can detect when a sequence is occurring and intelligently issue pre-fetch requests to the respective DA.
      • As long as the ahead buffer is kept full enough, it will minimize seek latency.
  • Meta Volume decisions
    • Concatenated metas gave good sequential read but not great random.
    • Now with TDEVs, concatenated metas are recommended.
      • They are already striped at the pool level.
      • They can be extended while leaving data in place.

Replication Considerations

  • Local replication with TimeFinder/Clone. Thin devices will take longer.
    • 4 DA pairs with 480 drives
      • Mirrored thick did 1500 MB/s
      • Mirrored thin did 1100 MB/s
  • Various thin source allocations
    • With less actual allocated data, clone pre-copy times could be faster than thick. This is just because there will be less data to copy to the clone.
  • Remote Replication with SRDF/S
    • Will have higher response time than thick for pre-written TDEVs
    • According to graphs by roughly 30-40%
  • Remote Replication with SRDF/A
    • Pre-written doesn’t see as much overhead.
    • Unallocated still see additional response time.

Best Practice Consideratoins

  • Always consider disk throughput requirements when creating or growing a data device pool
  • Segregate applications by pools if they won’t play well together
  • Use R1 or R5 (3+1) when write performance is critical. Use R6 for highest available within a thin pool.
  • Pre-allocate capacity if response sensitive applications expand by randomly writing into new space.
  • Use concatenated meta volumes for ease of expansion
  • Be aware of performance considerations of replication
  • General performance tuning principles still apply to thin devices

Permalink Leave a Comment

FCoE – Considerations for Deploying FCoE

May 21, 2009 at 10:46 am (Information)

Converged Enhanced Ethernet

  • Lossless Ethernet must be FoE aware
  • FIP snooping

Encapsulation

  • Maximum size frame is 2240 bytes
  • Mini-jumbo frame support is required
  • Not an issue for direct connect environments today
  • 802.1q tag carried CoS information

Data Center Bridging Capability eXchange Protocol (DCBX)

  • Extension of the link layer discovery protocol (LLDP)
  • Allows for exchange of priority map values for both FCoE and the FCoE initialization protocol (FIP)
  • Enables lossless behavior

FIP – FCoE Initialization Protocol

  • Bridges the gap between the expectations of FC and the reality of FCoE
  • Allows initiators and targets to know who to login with
  • One-to-many relationship is built in
  • LKA (link keep alive) and FIP CVL (clear virtual links) allows for logout from fabric should the logical link be lost
  • Implicit security (man-in-the-middle is difficult) provided that FIP snooping and dynamic ACLs are implemented
  • However, FCoE with FIP is more complex and more open for mis-configuration issues
  • Allows enode to:
    • Perform VLAN and FCF discovery
    • Ensure that the layer 2 network is capable of supporting mini-jumbo frames
    • Fabric login
    • LKA protocol
  • FIP VLAN request
    • Multi-cast
    • Allow the enode (CNA) to discover which VLANs FCoE services are being provided
    • All FIP requests and responses use a pre-defined set of TLV (type, length, value) data structures.
  • FIP VLAN notification
    • Unicast
    • Both FCFs response
  • FIP Solitication
    • Multicast
    • Allows the enode to discover which FCFs (fibre channel forwarders) are available for login
    • 802.1q tag and max FCoE size field tag
  • FIP Advertisement
    • Unicast
    • Note the priority, name id (fabric WWN), and max FCoE size
    • Max FCoE size is a field padded to the proper size (2158 bytes in length)
    • Dynamic ACL updated via FIP snooping
  • FIP FLOGI
    • Unicast
    • The FCF with the lower priority is sent the FLOGI

Priority Flow Control (PFC)

  • Necessary since FC requires lossless environment
  • Without PFC, normal periodic SAN congesntion will cause frames to drop
  • Uses 8 transmissions lines with independent transmit queues and receive buffers
  • Maximum distance is today is 50 meters. Not a hard limit. It’s a function of the high water mark of the buffer and ?? (missed that detail)

Use cases

  • Already have investment in FC or are building a new DC
  • Already have a LAN that supports 10 GE to core
  • FCFs today are anywhere between 2.5:1 – 5:1 oversubscribed (assumes FCoE switch is fully pop’ed with CNAs, will be solved when FCoE functionality is integrated into director class switches)

Physical Connectivity

  • Twinax (copper) and short range optical (fibre)
    • Twinax is much cheaper then fibre solutions (1/10 the cost)
    • Twinax uses less power

Permalink Leave a Comment

Clariion: Using Navisphere Analyzer to Optimize Performance for SATA, FC, and EFD

May 20, 2009 at 10:26 am (Information)

Periodic data gathering (for low system impact) so most values are averages (default archive interval is 2 minutes)

Some statistics are implied by I/O rates

  • Response times are derived and can be spoofed by workloads
  • Measurement would be performance prohibitive

Using the Survey Charts

  • Customize and add the advanced characteristics in the General tab
  • Setup Survey Charts (70% utilization, 100ms response time, 12 average queue length, leave interval at 0 to see when it exceeds thresholds at any time)
  • Setting the threshold will cause those that exceed the threshold to be yellow/red boxes
  • Double-click the lun, and it will open the detailed performance view of the lun
  • SATA candidates (all-green boxes with flat lines are a start)

Storage Processors

  • Properties page
    • Cache allocation (200MB read cache is most you might need)
    • Is cache on?
    • Cache page size
    • ABQL (average busy queue length) – FLARE 26+
  • Utilization <70% in general, peaks ok
  • EFD can be added if <70%
  • What if SP utilization is >70%?
    • Random writes – convert R6 to R5/R10
    • Sequential writes – check stripe optimizations, use R3 or R5
    • Consider bypassing cache for large I/O
    • Random reads – upgrade to bigger controller
    • Big configurations – go to CX4
  • Dirty pages
    • Exceeding high watermarks consistently? Saturating cache?
      • Convert SATA to FC
      • Convert FC to EFD
      • Cache bypass on large I/O
      • R10 on small I/O

Host Ports (front-end ports)

  • 360 MB/s is the practical max for FC
  • 100 MB/s is the practical max for iSCSI
  • For best results, zone your EFD luns to ports not near the maximum

LUN

  • Properties page
    • Check ownership for trespass state
    • Cache hit ratio – the best EFD candidates will have low read cache hit ratios
    • SATA LUN with low RCHR may benefit from FC drives, as FC are much better at random reads
  • Click on the lun in the LUN tab, then click on the RAID group tab, it will open that RG to show which other luns are on this RG
    • Groups with a lot of luns will see very random access, not good for high-speed sequential
  • Critical lun stats
    • Utilization
      • Simply the percentage of time the lun is busy
      • Busy means at least 1 I/O, so need to be careful when looking at utilization
    • Response time
      • Directly affects the user
      • Queue length affects this directly
      • Cache response time – affected by SP utilization and dirty state
      • Disk read response time
        • FC read service time 5 ms
        • EFD read serve time 0.1ms
      • I/O mix
    • Average busy queue length
  • Right-click lun to get I/O distribution summary
    • EFD drives good for small I/O
    • SATA drives good for large I/O
  • Read cache hit ratio (RCHR)
    • Low read cache hit ratio favors EFD
    • High read cache hit ratio favors SATA
    • When the lun is busy, look at the ratio to see where it should reside
    • Then look at the throughput

Disk IOPS per drive (rules of thumb for conservative planning, not maximums)

  • 80 for SATA (large block, sequential, small disk seek distances)
  • 140-180 for FC (small I/O)
  • 2500 for EFD (conservative) (requirements for high IOPS or to keep service times down)

Disks

  • Coalescing is important for good bandwidth, especially SATA (putting small IO together into large IO, typically for sequential write IO)
  • Look at the read/write size for the lun and the disk itself. If the lines match, there is no coalescing. If the read/write size for the disk is higher, then you do have coalescing.
  • Signs of sequentiality
    • High pre-fetch used and RCHR
    • Full stripe writes and stripe size (only relevant for parity RAID)
      • FSW * strip write size = sequential write bandwidth
      • Each disk 256KB * 4 = 4 1 MB/s

Primary Criteria for selecting EFD

  • LUN or disk utilization >70%
  • LUN ABQL >12
  • Response time >10ms
  • Read/write ratio 60% or better for reads (better if 80%)
  • I/O < 32KB

Secondary Criteria for selecting EFD

  • Read cache hit ratio <50%
  • Force flushes consistently >10 (busy cache, circumvent cache for EFD)

Permalink Leave a Comment

Clariion Enterprise Flash Drives: Achieve Highest Performance

May 20, 2009 at 8:47 am (Information)

Missed the first 10-15 minutes of the session. Notes begin thereafter.

Lab Test Scenario

R5 4+1 73GB EFD and 15k

SP cache off for EFD (default), on for FC luns

Random multi-threaded 8KB test workloads

IO Type

FC response time (ms)

FC IOPS

EFD response time (ms)

EFD IOPS

IOPS per drive

Read

22

1450

1.1

30000

6000

Write

42

590

6.1

15000

3000

Read/Write, 60/40

25

1000

3.0

7500

1500

Note: The EFD IOPS numbers are roughly accurate from what he had. I didn’t have enough time to get them all.

Bandwidth is not the strongest suit for EFD

FC drives very effective at low stream counts, with SP caching

EFD beats FC when multiple sequential streams are running, without SP caching

2,500 IOPS per drive, 100 MB/s per drive

Compare to 180 IOPS for 15k

For simple estimation for all EFD drives, mixed read/write workloads

Remember this is a backend measure. Need to add RAID parity penalty calculations

System Limitations

  • Ports – FC connections, 4 Gb/s bandwidth, 75000 IOPS
  • SP – CPU processing rate, memory system bandwidth
  • Buses – FC loops, 4 Gb/s bandwidth, 75000 IOPS
  • Limits can be reached with large number of HDDs or small number of EFDs

Configuring Cache

Default – cache is off for EFD luns. Not necessarily a best practice, just the default

Why?

  • Uncached EFD performance is excellent
  • Allocate cache pages to HDD luns that benefit more

When to change the default:

  • Sequential reads
    • Creates multiple threads from single thread load
    • Pre-fetch multipliers can optimize for EFDs with more smaller requests
  • Response time critical writes
    • SP cache can hide RAID parity overhead
    • Helps match fast EFD read times with slower write times
  • EFD-only array

Configuring Layout

  • All RAID groups available
    • Per GB cost and low service times favor R5
    • Narrow R10 good option for small deployments, best write response time with cache off
    • Best practices still apply
  • LUNs
    • High concurrency gives highest EFD throughput
    • Multiple LUNs on RG
    • Dual SP ownership
  • Bandwidth
    • Spread across buses in cases where bandwidth is required

Virtual Provisioning

  • Pool management reduces maximum performance
  • Policy based data placement means more performance variance
  • Makes sense if highest EFD performance is not required and provisioning convenience is more of a priority

EFD-friendly Workloads

  • High ratio of reads, high percentage random reads
    • Reads faster than writes
    • Large or small block are both winners
  • Requirement for very low latency
  • High concurrency
    • Needs many threads to obtain highest throughput

Replication Advantage

  • COFW IO operations on the source LUN contends with all other traffic and queues on source drives
  • Use EFD for source lun and HDD reserved lun pool. The read performance for COFW will improve source host response time.

Permalink Leave a Comment

Clariion: Performance and Design Implications of Mixed Workloads

May 19, 2009 at 3:38 pm (Information)

RBA – Ring Buffer Analyzer, internal analysis tool

TCD – target command device (front-end ports of the array)

LBA distribution – locality of reference

Battle of the RAID Group

Default segment/pre-fetch multiplier is 4.

LUN migration should be run with 511 write-aside on the target to avoid cache overrun. 40MB/s will run through the SP, and SP utilization will be moderate. This restriction is lifted in Release 29. It will default LUN migration to write-aside.

The duration of read starvation depends on the amount of cache to de-stage to the low watermark and the state of the LRU queue.

In the larger CX4 arrays with more write cache, you may want to change the high and low watermarks to reduce the effect of read starvation. (60-65%)

This read starvation could have a significant effect on Jetstress testing.

The default watermarks (60-80%) are not necessarily needed. There is nothing wrong with low and close watermarks. The most important cache consideration is being able to have write cache available for a large bursty write than de-staging cache more frequently. However, if you’re getting high hit rates, then you may want to keep more data in cache.

Idle delay timer (2 seconds) will kick off a de-stage activity. Avoid “lazy writers” by setting the to 200 (20 seconds) with:

naviseccli -chglun -l <lun> -t 200.

Release 29 will default to 20 seconds.

  1. Write to a lun, written to cache
  2. The “clean-dirty” bit must be set dirty on the drive to flag presence of vault content after an SP failure
  3. After the state bit is set, the write can be ack’ed.
  4. After 2 seconds, idle flushers clear this LUN’s cache entries
  5. Clean-dirty is set to clean
  6. A new write comes in. We need to set “dirty” again
  7. If the disk is really busy, we will be very slow returning the ack.

This activity is seen more commonly now because of VMware and clusters. You can see this if you run a perfmon.

Battle of the SP

Not supposed to mix OLTP and DSS because they have completely different I/O patterns. However, on larger CX4 arrays, you can dedicate buses to different patterns.

Rules of Thumb

Multiplier (CPUM)

CX4-960 – 1.00

CX4-480 – 0.65

CX4-240 – 0.55

CX4-120 – 0.30

A – CPUM x 50k reads/s standard lun

B – CPUM x 16k write/s R5

C – CPUM x 20k writes/s R10

D – CPUM x 40k reads/s, Snaps, MV/s, clone source

E – CPUM x 7.5k writes/s MV/s

F – CPUM x 6k writes/s, clone-in-sync

G – CPUM x 2.5k writes/s, Snap COFW

H- CPUM x 6k writes/s, Snap non-COFW

Data logging % = Number of LUNs / Max LUNs * 10%

One SP’s utilization will be the sum of the proportional contributions of each I/O type

The memory subsystem is doing 3000 MB/s for 600 MB/s.

Cycles of execution are tied not only to the I/O but also memory subsystem operations.

  • Host IO (600 MB/s)
  • De-stage to disk (600 MB/s)
  • CPU xor (600 MB/s)
  • Sending data to peer SP (600 MB/s)
  • Receive data from peer SP (600 MB/s)

So now for 600 MB/s of activity, we see 3000 MB/s.

Note in NDU failover, polling and data logging do not failover.

Battle of Large Block and Small Block

Large block I/O on the disk can be very disruptive to small block I/O.

8KB random read at 5.5ms for 180 IOPS

512KB random read at 15ms for 66 IOPS

In a single thread to the same drive, they will both get 1000/(5.5+15) = 48.78. Now both activities are getting 48 IOPS. Big decrease for the small block I/O.

You know you have this mixture if you have high write response times but the dirty cache pages are not at 100%.

Check your HBA queue depths settings and path distribution.

Permalink Leave a Comment

Maintaining High NAS Performance Levels: Celerra Performance Analysis and Troubleshooting

May 19, 2009 at 1:06 pm (Information)

Most performance concerns can be summarized by 4 questions:

  1. What am I getting?
  2. Is that what I should be getting?
  3. If not, why not?
  4. What, if anything, can I do about it?

Characterize workload in terms of

  • IOPS
  • Size (KB/IO)
  • Direction (read/write)

Performance triage domains

  • Host
  • IP network
  • Data mover
  • Fibre channel
  • Storage processors
  • More fibre channel
  • Disk drives

Celerra Volume Stack

Filesystem

Meta volume

Slice volume

Stripe volume

Basic volume (dvols)

Identify the protocol(s) in use

server_stats server_5 -summary cifs,nfs -interval 10 -count 6

server_stats server_5 -summary nfs -interval 10 -count 6

server_stats server_5 -table nfs -interval 10 -count 6

  • Operations will break down between v3Write, v3Create, etc.

server_stats server_5 -table fsvol -interval 10 -count 6

  • Correlates the filesystem with the meta-volumes
  • The percentage contribution of write requests for each meta-volume is shown (“FS Write Reqs %”)

server_stats server_5 -table dvol -interval 10 -count 6

  • Shows the write distribution across all volumes
  • AVM will work hard to prevent disk overlap for a filesystem
  • Slice your stripes, don’t stripe your slices (basically create the stripe across all volumes first, then slice those up as needed)
  • root_ldisk – log disk, high activity on this disk will mean lots of log activity in the server_log. The ufslog hit high threshold. But is it a problem?
    • Data mover memory includes inodes and data blocks
    • Data mover cache is write-through, meaning that data needs to be destaged from cache before it will acknowledge to the host. This is because the cache is not protected from power loss.
    • When writes are coming in, data blocks are updated, and inodes need to be updated.
    • Inode updates are writing to the ufslog staging buffer.
    • The staging buffer contains uxfs log transactions and then destages to disk.
    • Ufslog hit high threshold means in in-memory copies of uxfs log transactions which have already been written to disk could not be retired because the dirty meta data to which they point has not yet been flushed to the filesystem metavolume
    • This message indicates contention at the filesystem metavolume, not the ufslog volume.
    • If ufslog is an issue, the error message will be “staging buffer full, using next one”. One periodically is not an issue. It’s actually good that the buffer is being use. Only if you get a lot of these per second.

nas_disk -l | grep root_ldisk

APM000123456789-0001

navicli -h spa getlun 1 -rwr -brw -wch (read write rate, blocks read/written, write cache hit)

For IOPS, 8 threads of I/O will yield the greatest increases. 8 to 64 threads yields nominal improvement.

nas_fs -I fs1

nas_disk -I d38

  • Look at stor_dev (hex) and convert to decimal

navicli -h spa getlun 27 -rwr -brw -wch

  • Blocks written / write requests = blocks per write (multiply by 512 bytes to get block size)

navicli -h spa getlun 27 -rwr -brw -wch -disk

  • Shows the disks associated with the lun

navicli -h spa getdisk 2_0_2 -rds -wrts -bytrd -bytwrt (read reqs, write reqs, kbyte read, kbytes written)

  • Kbytes written / write requests

Putting it all together, we saw:

Nfs write size 8KB

Dvol write size 8KB

Lun write size 8KB

Disk write size 32KB

Go to the host, and check the filesystem:

df

grep fs1 /etc/mtab

Check the rsize=8192,wsize=8192 settings. These are the buffer size limits. Even if the application wanted to write 32KB, the buffer is limited to 8KB. Update those settings to 32768. You’d need to unmount and remount with the new settings. Needs to be coordinated since it will be disruptive.

server_ifconfig server_5 -all

server_stats server_5 -table net -interval 10 -count 6

  • Network in (KiB/s) / Network In (Pkts/s) to figure out the packet size
  • Do this for In and Out to see what the standard MTU size is

server_netstat server_2 -s -p tcp

  • Look for transmission errors (retranmissions)
  • A node is aware only of its own retransmissions so be sure to check both ends of the connection

Navisphere Analyzer Command Line Interface is a good reference for looking at data.

You can extract specifically what you want as a CSV file.

naviseccli analyzer -archivedump -data spa.nar -stime “…” -ftime “…” -object l -format pt,on,rio,rs,wio,ws | grep _d38

Permalink Leave a Comment

Understanding New Features of Enginuity 5874 for Symmetrix V-Max

May 19, 2009 at 1:05 pm (Information)

Base/Infrastructure Enhancements

  • RAID Virtual Architecture (RVA)
    • New RAID architecture internal to the Symmetrix
    • No longer requires multiple mirror positions for RAID protection
    • Unchanged supported types (R1, R5 3+1 and 7+1, R6 6+2 and 14+2)
    • All features use a common RAID engine
    • Enabling feature for Virtual LUN migration
    • Only requires 1 mirror position on device and opens up the remaining 3 mirror positions for other needs
  • Large volume support
    • Can support volume sizes up to 240GB (60GB on DMX4)
    • May eliminate the need for meta devices or reduce the number of members
    • Supports Mainframe EAV (extended address volumes)
    • Reduces the likelihood of encountering system addressing limits
  • 512 hypers per physical drive
    • Previously 255 max
    • Allows customers to better utilize large drives like 1TB SATA

Management and Provisioning Features

  • Auto-provisioning Groups
    • Creates storage groups which contains Symmetrix devices
    • Port groups which contains FA ports
    • Initiator groups which contains host HBAs
    • Create a view which associates these
    • Mapping and masking happens through the views now
    • Host will still need to rescan to see the storage
  • Dynamic Provisioning Enhancements
    • Introduced in 5773
    • Ability to make configuration changes like mapping without requiring a configuration file update (bin file)
    • In 5874, extended to include other device attributes like BCV, Dynamic SRDF, etc.
  • Concurrent Provisioning
    • Concurrent configuration changes (mapping and unmapping in parallel)
  • Management can be done over IP network
  • SMC can run on the Service Processor, which is on the customer’s internal network, out-of-band (FC)
  • No longer needs to be connected to the DMX via FC. Can still do it that way but not necessary

SRDF Enhancements

  • 2-Mirror SRDF consistency support
    • Previously SRDF/S/S with both legs in the same consistency group
    • Now you can put each leg in a separate consistency group so they can be changed separately
  • 250 SRDF groups
    • Previously 128 groups
    • Up to 64 SRDF groups are now allowed per FC or GigE SRDF port (previously 32)
  • SRDF/Extended Distance Protection (SRDF/EDP)
    • New long distance replication solutions
      • SRDF/S (from primary to secondary) to SRDF/A (from secondary to tertiary)
    • Diskless R21 pass-through device. No back-end physical disk requirement
    • Diskless R21 device is not host-accessible
    • Ability to achieve zero RPO at tertiary site
    • Only the secondary needs to be V-Max on 5874
    • Will need to add cache to support the diskless R21 devices
  • R22 dual secondary device
    • New device that can act as the target device for two distinct R1 devices.
    • Accepts R/W from only one of the source devices
    • This is used for SRDF/STAR situations
    • Improves the RTO if the primary site is lost
  • Add/remove devices to/from an active SRDF session
    • Prior to this release, adding/removing a device would mark the whole group invalid.
    • Once it becomes fully synchronized, then it will be added to the group consistency
  • SRDF Invalid Track Handling
    • Previous method of searching for invalid tracks (tracks owed) has been redesigned
    • Increases speed of SRDF control operations, like createpair (10x faster, measured)
    • Increases speed of establish
    • Improves the search for the last few tracks, known as the “long tail”
  • Concurrent TimeFinder/Clone and SRDF restore
    • Previously needed to do clone restore to R2, then restore R2 to R1
    • Can now happen at the same time, for both native clone and clone emulation

TimeFinder Enhancements

  • Target device of a clone session can be used as the source for another clone session without terminating the initial clone session
    • Source A > Clone B > Clone C
    • Supported in native clone and clone emulation
  • Performance improvements
    • Up to 250-300% increase in copy rates for R5/R6
    • Reduction of cache consumption during clone copy

Open Replicator Enhancements

  • Support for TimeFinder/Snap virtual devices (VDEVs)
    • Open Replicator push sessions can be created using VDEVs as the control devices
    • No longer requires full-copy clones or BCVs
  • Open Replicator support for IBM System I (iSeries) devices

Virtualization Enhancements

  • Enhanced Virtual LUN technology
    • Enables transparent, non-disruptive data mobility between storage tiers for standard Symmetrix volumes
    • Built on new RVA
    • Move between different drive spindles or change RAID protection type
    • Supports migration of meta devices
    • Works with all replication technologies
      • SRDF, TimeFinder/Snap, TimeFinder/Clone, Open Replicator
      • Active replication being migrated stays intact
      • Incremental relationships maintained
      • Allows customers to meet federal DR requirements
    • Migration can utilize configured or unconfigured disk space as target
  • Virtual Provisioning
    • Thin provisioning (available in previous releases)
    • Support to drain physical devices to shrink storage pools
    • Create 4x larger volumes
    • Virtually provision all tiers and RAID levels

Permalink Leave a Comment

Symmetrix V-Max Series with Enginuity Overview

May 19, 2009 at 12:57 pm (Information)

Time to update the blog. I’ll post my notes from EMC World 2009. Here’s the first of the series.

V-Max SE (single engine)

Matrix Interface Board Enclosure (MIBE)

Scaling up – growing from a small configuration to a large configuration

Scaling out – growing out beyond the physical backplane

V-Max Engine

  • Host and disk IO ports
  • Multi-core storage CPUs
  • Mirrored global memory
  • V-Max interconnects
  • Enginuity storage OS

The global memory is distributed across all V-Max engines.

V-Max system

  • Scales 1-8 engines
  • 2-16 director boards
  • 128 FC FE ports
  • 64 FICON
  • 64 GigE/iSCSI
  • 1TB memory
  • 10 storage bays
  • 96-2400 drives
  • Engines go from 1 to 8 from bottom to top
  • Director boards go from 1 to 16 from bottom to top
    • Engines are inside out
    • 7-8 are the first two directors for engine 4
    • 16-15 are engine 8?
    • 1-2 are engine 1?

V-Max system – one engine (120 drives)

  • 2 directory boards
  • 16 FC FE
  • 8 FICON
  • 8 GigE
  • 128GB memory
  • 120 drives
  • Expansion bay to grow to 360 drives
  • V-Max engine 4 is the only engine
  • Director boards 7-8

V-Max Engine

  • 2 director boards
  • Redundant power supplies
  • Redundant battery backup
  • Redundant management modules
  • Front-end connections for hosts
  • Back-end connections to drives
  • Multi-core CPU
  • Global memory
  • Redundant connections to the V-Matrix interconnect
  • Back-end
    • Contains 4 back-end I/O modules and supports redundant direct connection to 8 drive enclosures
    • Supports dual initiator failover across directors
    • Supported drives
      • 200GB/400GB EFD
      • 146GB/300GB/450GB 15k
      • 300GB 10k
      • 1TB SATA
  • Front-end
    • FC 2/4 Gb for host and SRDF
    • FICON 2/4 Gb for mainframe
    • iSCSI/GigE 1 Gb for host and SRDF
    • iSCSI/GigE also has hardware compression
  • Memory
    • 32GB
    • 64GB
    • 128GB
    • Upgrade within an engine or add another V-Max engine
    • Memory is mirrored across engines through the V-Max interconnect
  • 4 cooling modules and redundant power supplies
  • 2 director boards in each engine, one even and one odd (odd is lower board, even is higher board)
  • FA ports (odd ports are on the right, even ports are on the left)
  • Systems is designed to be able to tolerate a single engine failure (when there are more than one)

V-Max Matrix Interface Board Enclosure (MIBE, pronounced mi-bee)

  • Supports up to 16 director boards
  • 1U, each sits side by side
  • A fabric and B fabric to redundantly connect all engines together

Configurations

  • Backend cards and V-Max cards are not configurable
  • IO modules are configurable
    • All FC
    • All FICON
    • All iSCSI
    • FC and FICON (FC, EF, FC, EF)
    • FC and GE (FC, GE, FC, GE)
    • EF and GE (EF, GE, EF, GE)
    • Memory (32GB, 64GB, 128GB)
  • Direct connect to 8 DAEs
  • Daisy chain to 8 more DAEs (up to 2 DAEs chained for 3 total DAEs per loop, hence 3 * 8 * 15 = 360 360 drives)
  • Base Configuration
    • 8 DAEs in the system bay
  • 1st Daisy Chain
    • 8 DAEs in the expansion storage bay (4 on the bottom row Q1, 4 on bottom row Q2)
  • 2nd Daisy chain
    • 8 DAEs in the expansion storage bay (4 on the top row Q1, 4 on the top row Q2)
  • First 4 engines support two direct connect drive bays (480 drives)
  • 1 level of daisy chain drive bays can be added to these engines (960 drives)

Left side

7 (Daisy)

7 (Daisy)

7 (Direct)

5 (Daisy)

5 (Direct)

2 (Daisy)

2 (Daisy)

2 (Direct)

4 (Daisy)

4 (Direct)

System bay (middle)

8

7

6

5

4

3

2

1

Right side

6 (Daisy)

6 (Daisy)

8 (Direct)

8 (Daisy)

8 (Direct)

3 (Daisy)

3 (Daisy)

1 (Direct)

1 (Daisy)

1 (Direct)

Direct connect needs to be close

Daisy chain bays can be geographically dispersed

Director board

  • Two quad-core (one disk, one host)
    • H,G,F,E (front-end, host)
    • D,C,B,A (back-end, disk)

Questions

What’s the speed of the V-Max Interconnect?

What’s the maximum distance for dispersion? 73 inches for DMX, 146 inches for V-Max (you can disperse any/both daisy chain bays, and it does require an RPQ).

Permalink Leave a Comment

SSD Performance

January 13, 2009 at 6:28 pm (Performance)

We’re doing some performance testing with a customer right now using solid state drives (SSD). The customer currently has 9x 146GB drives configured as R5 8+1. Marketing numbers will typically state between 15x to 30x IOPS performance over 15k fibre channel drives. We all know real world will typically yield less. Below is a table of tests run with iometer.

 

Profile IOPS MBps Response Cpu Drive Count IOPS per Drive MBps per Drive IOPS Comparison to 15k
2KB and 4KB blocks 50/50 split – sequential 10973 32 185 6.4 9 1219.22   6.77
2KB and 4KB blocks 50/50 split – random 13996 31 145 8.6 9 1555.11   8.64
1M blocks sequential 264 264 5267 1 9   29.33  
10MB blocks sequential 28.15 281 38911 1 9   31.22  

 

From an IOPS perspective, we got 6.7x to 8.6x improvement, assuming 180 IOPS for 15k drives. While it’s not 15x or 30x, it’s still impressive. You’ll note that the response time increases drastically for the larger block I/O and that CPU utilization climbs with the smaller block I/O, as you’d expect.

 

The numbers say it all.

Permalink Leave a Comment

Next page »