Best Practices for Symmetrix Configuration

May 13, 2010 at 10:58 am (Information)

 

Considerations

  • Configure enough resources for your workload
  • Use resources evenly for best overall performance
    • Spread across all available components
    • Includes FE, BE and disks
    • Path management can help FE
    • FAST/Optimizer can help BE

 

Commonly asked questions

  • What size system do I need?
    • Each resource has a limit of I/Os per second and MBs per second
      • Disks
      • Back-end controllers (DAs)
      • Front-end controllers (Fibre, FICON, GigE)
      • SRDF controllers
      • Slices (CPU complexes)
    • Configure enough components to support workload peaks
    • Use those resources as uniformly as possible
    • CPU utilization
      • As a rule of thumb, a limit of no more than 50-70% utilization is good if response time is critical
      • A higher utilization can be tolerated if only IOPS or total throughput matters
    • Memory considerations
      • Ideal to have same size memory boards and same memory between engines
      • Imbalance will make little or no difference with OLTP type workloads
      • Imbalance will create more accesses to boards or engines with large amount of memory, creating a skewed distribution over the hardware resources
    • Front-end connections
      • Go wide before you go deep
        • Use all 0 ports on director first and then the 1 ports
        • Spread across directors first, then on same director
        • Two active ports on one FA slice do not generally do more I/Os
      • Ratios (random read hit normalized at 1)
        • Random read hit 1
        • Random read miss 1/2
        • Random Overwrite I/O’s 1/2
        • Random new write 1/4
      • Worst connection for a host with 8 connections
        • All on one director
        • Instead do one connection per director
    • Disks
      • Performance will scale linearly as you add drives
        • You can see up to 510 IOPS per drive when benchmarking at 8KB, but 150 IOPS is a reasonable design number for real world situations
        • Note that with higher IOPS comes higher response times as well as queues will grow
      • Until some back-end director limit is reaches
      • With smaller I/O sizes (<32KB), the limit reaches is the CPU limit
      • With largest I/O sizes (>32KB), we can reach a throughput limit in the plumbing instead
    • Engine Scaling
      • Scales nearly linear, though not quite.
      • From 1 to 8 engines, it’s 6.8 to 7.8x WRT to IOPS (8KB I/O)
      • From 1 to 8 engines, it’s 4.2 to 7.1x WRT to bandwidth (64KB I/O)
      • Scaling from 1 to 8 shows worst numbers. 4 to 8 showed better numbers.
  • What’s the optimum size of a hyper or number per disk?
    • General rule of thumb, fewer larger hypers will give better overall system performance.
      • There is a system overhead to manage a logical volume so it makes sense that more logical volumes could lead to more overhead.
    • Frequently legacy hyper size is carried forward because of migration
    • Virtual Provisioning will make the size of the hyper on the physical disk
      • You can create very large hypers for the TDATs and still present small LUNs to the host
    • There can be a case of having too few hypers per drive
      • Because it could limit concurrency
      • Set a minimum of 4 to 8 hypers
      • Not an issue with large drives or protections other than R1
  • What is the optimum queue depth?
    • Single threaded (or 1 I/O at a time), the I/O rate is simply the inverse of the service time.
      • For a 5.8ms service time your maximum IOPS is 172.
      • Same drive with 128 I/Os queued can get nearly 500 IOPS
    • We need 1-4 I/Os queued to the disk to achieve the maximum throughput with reasonable latencies
      • Lower queue lengths if response time is CRITICAL
    • Higher if total IOPS is more important than response time
    • With VP, the LUN could be spread over 1000s of drives
      • Queue depth of 32 per VP LUN is probably a reasonable start
    • As IOPS go up, response time will exponentially get worse
  • What is the optimum number of members in a meta volume?
    • 255 maximum supported
    • Reasonable sizes for meta member counts are something like 4, 8, 16, 32
    • Even numbers are preferred
      • Powers of 2 fit nicely into back-end configurations
      • Powers of 2 not important for VP thin metas
    • Getting enough I/O into a very large meta can be a problem
      • 32-way R5 7+1 meta volume would need at least 256 I/Os queued to have 1 I/O per physical disk
  • Should I use meta volumes or host-based striping? Or both?
    • Avoid too many levels of striping (plaid)
    • One large meta volume may outperform serveral smaller meta volumes that are grouped in a host stripe
    • In many cases, host-based striping is preferred over meta volumes
      • One reason is because there will be more host-based queues for concurrency that the host can manage before even getting to the array.
    • However, meta volumes can reduce complexity at the host level
    • So it all depends
    • 24-way meta versus 6 host x 4-way meta – average read response time was better with host-based stripe
  • Striped or Concatenated Metas?
    • In most cases, striped meta volumes will give you better performance than concatenated
      • Because they reside on more spindles
      • Some exceptions exist where concatenated may be better
        • If you don’t have enough drives for all the meta members to be on separate drives (wrapping)
        • If you plan to re-stripe many meta volumes again at the host-level
        • If you are making a very large R5/R6 meta and your workload is largely sequential
      • Concatenated meta volumes can be placed on the same RAID group
      • Don’t place striped meta volumes on the same RAID group (wrapped)
    • Virtual Provisioning
      • Back-end is already striped over the virtual provisioning pool so why re-stripe the thin volume (TDEV)
      • May be performance reasons to have a striped meta on VP
      • Device WP “disconnect” between front-end and backend
        • 5874 Q210SR, 5773 future SR fixes this
      • Number of random read requests we can send to a single device
        • Single device can have 8 outstanding reads per slice per device (TDEV on FA slice)
      • Number of outstanding SRDF/S writes per device
        • Single device can have 1 outstanding write per path per device
      • If it is important to be able to expand a meta, choose concatenated
  • What stripe and I/O size should I choose?
    • For most host-based striping, 128KB or 256KB is good
    • May want to consider a smaller stripe size for database logs, 64KB or smaller may be advised by a Symmetrix performance guru
    • I/O sizes about 64KB or 128KB show little to no performance boost (flattens out). 256KB may actually decrease throughput. This is because everything is managed internally at 64KB chunks.
  • Segregation
    • For the most optimal system performance, you should not segregate applications/BCVs/Clones onto separate physical disks/DAs or engines
    • For the most predictable system performance, you should segregate
    • Tiers should share DA resources so that one tier will not consume resources for another tier
  • What disk drive class should I choose?
    • EFD provide the best response time and maximum IOPS of all drives
    • 15k provide 30% faster performance than 10k (random read miss)
    • 15k provide 56% faster than SATA, 10k provide 39% faster than SATA (random read miss)
    • SATA still does well in sequential read (with single threaded and larger block sizes) (basically good in single stream, bad with multi-thread and therefore disk seeks)
  • What RAID protection should I choose?
    • Performance of reads similar across all protection types (number of drives is what matters)
    • Major difference with random write performance
      • Mirrored: 1 host write = 2 writes
      • R5: 1 host write = 2 reads + 2 writes
      • R6: 1 host write = 3 reads + 3 writes
    • Cost is also a factor
      • R5/R6 are best at 12.5% and 25% protection overhead
      • R1 has 50% protection overhead
  • How much cache do I need?
    • Easiest method is to utilize the Dynamic Cache Partition White If (DCPwi) tool
      • Put like devices together in cache partitions
      • Start analysis mode and collect DCP stats
  • How do I know when I’m getting close to limits?
    • Watch for growth trends in your workload with SPA
    • Look out for increasing response time (host-based tools like iostat, sar, RMF)
    • Monitor utilization metrics in WLA/STP
    • Better to be pro-active than waiting to hit th ewall
    • Any utilizations well over 50% should be considered a possible source of future issues with growth

Permalink Leave a Comment

Performance as a Function of Utilization on CLARiiON

May 12, 2010 at 6:00 pm (Information)

Measurements

  • Utilization = 100% * busy time in period / (idle + busy) time in period
  • Throughput = total number of visitors in periods / period in length in seconds
  • Average Busy Queue Length = sum of queue upon arrive of visitor x / total number of visitors
  • Queue length = ABQL * utilization/100%
  • Response time = queue length / throughput (Little’s Law)

 

For low LUN throughput (<32 IOPS), response time might be inaccurate

  • Response time here is calculated, lazy writes will skew the LUN busy counter
  • RBA actually measures the response time

 

Dual SP ownership of a disk

  • Can also impact response time
  • Each SP only knows about its own ABQL, throughput and utilization for the disk
  • At poll time, they exchange views. The utilization is max(SPA,SPB)
  • ABQL is computed from the sum of the sum
  • And SP throughput is the sum of SPA and SPB throughput

 

Be wary of confusing SP response time in Analyzer with the average response time of all LUNs on that SP

  • Response time is calculated and based on utliization
  • A LUN is busy (not resting) as long as something is queued to it
  • An SP is busy (not resting) as long as it is not in the OS idle loop
  • While a disk is busy getting a LUN request, the LUN is still busy
  • While a disk is busy getting a LUN request, the SP might be idle
  • The SP response time is generally smaller than the average response time of all the LUNs on that SP
  • Host response time is approximated by LUN response time

 

Recall from last year:

  • Rules of  Thumb
  • Multiplier (CPUM)
  • CX4-960 – 1.00
  • CX4-480 – 0.65
  • CX4-240 – 0.55
  • CX4-120 – 0.30
  • CX3-80 – 0.50
  • A – CPUM x 50k reads/s standard lun
  • B – CPUM x 16k write/s R5
  • C – CPUM x 20k writes/s R10
  • D – CPUM x 40k reads/s, Snaps, MV/s, clone source
  • E – CPUM x 7.5k writes/s MV/s
  • F – CPUM x 6k writes/s, clone-in-sync
  • G – CPUM x 2.5k writes/s, Snap COFW
  • H- CPUM x 6k writes/s, Snap non-COFW
  • Data logging % = Number of LUNs / Max LUNs * 10%
  • One SP’s utilization will be the sum of the proportional contributions of each I/O type
  • Use 4KB for IOPS and 512KB for Bandwidth
  • I = CPUM x 1500MB/s read
  • J = CPUM x 600MB/s write (cache on)
  • Note: ASAP rebuilt, background verify, mirror syncs count against this number
  • Example: CX4-960, RAID 5, 9000 IOPS, 2:1 R:W, 8KB –> 38% utilization
  • 6000 read IOPs, 3000 write IOPs, 48MB/s read, 23MB/s write, RAID 5, CX4-960
  • 6000/50000 + 3000/16000 + 48/1500 + 24/600 = 12% + 19% + 3.2% + 4.0% = 38.2% SP utilization

 

His formula is low

  • Configuration polling
    • Pre-FLARE 26.31 configuration polling is another low priority internal function that affects utilization
    • Go to http://ipaddress/setup
    • Set Update Parameters in the Setup Menu and pick 300s. Update Interval to 300s.
    • Performance Interval (for statistics logging) is ok at 60. This does nothing compared to configuration polling and data logging.
    • Also include the -np (no poll) option whenever possible in CLI scripts
  • Data logging
    • 7-10% differential comes from default data logging settings in older FLARE revisions with a lot of LUNs
    • Throughput was still unaffected because Analyzer threads run at a lower priority than I/O threads
    • Navisphere commands could be sluggish because they would be at the same level
    • Fix it by changing from 60/60 or 60/120 to 300/300.
    • Data logging poll rate is the lower of the two.
    • This will signficantly reduce pre-FLARE 29 utilization
  • Navisphere operations, especially without -np (no pool)
  • Background verify, rebuild, LUN migration, zeroing operations
  • Snap, Clone, Mirror, SAN Copy overhead
  • Disk or bus bottlenecks
  • Heavy flushing

 

His formula is too high

  • Coalesced backend writes
  • Pre-fetch
  • Nature of the load
     

In FLARE 26.31, FLARE 28, FLARE 29, FLARE 30

  • Delta polling was introduced in FLARE 28 and back-revved to FLARE 26.31
  • Significantly reduces Navisphere overhead
  • FLARE 30, CLI commands without -np are given more processor time
  • FLARE 29, data logging utilization has been reduced 80%
  • FLARE 30 introduces fully provisioning virtual LUNs in pools of storage (thick LUNs)
  • H6099 document
  • NDU now uses % PrivilegedTime not % Processor Time as shown by Analyzer, 65% is safe (instead of 50%).

 

What will happen with SP utilization in the presence of EMC Flash Cache?

  • 64KB is the base element for analysis for migration into Flash Cache
  • There is a considerable amount of promotions (HDD > EFD) that will cost SP utilization. After the bulk of those initial promotions occur, it will be about 8-10% increased SP utilization for Flash Cache after warmup.

Permalink Leave a Comment

VMotion over Distance with VPLEX

May 12, 2010 at 3:57 pm (Information)

 

Vmotion without VPLEX

  • Cannot directly perform Vmotion since storage is not shared
  • Must first perform storage Vmotion

 

Vmotion with VPLEX

  • Enables direct Vmotion between data centers
  • Storage Vmotion is no longer required
  • Replicate the data once then move the VMs at will

 

Use Cases

  • Data Center Load Balancing
    • Optimize resources across several data centers
  • Disaster Avoidance and Data Center Maintenance
    • Evacuate data center ahead of a probable disaster
    • Move applicatoin to remote data center to perform maintenance on local data center
  • Zero-downtime Data Center moves
    • Move VMs and data to new data center then decommission old data center

 

Three Basic Configurations

  • Common configurations
    • Maximum supported distance, 100km (with 5ms latency)
    • ESX hosts in both data centers have common IP subnets (stretched layer 2 network)
    • ESX servers can participate in local HA and DRS-enabled clusters
  • VMFS volume built on a VPLEX distributed device
  • VMFS volume is then shared between ESX servers in two locations
  • Scenario 1 (distributed device)
    • Best practice
    • Continuous data protection and transparently protects against storage failures in either location
    • Continuous IO on biased cluster after WAN link failure
    • Continuous IO on biased cluster after non-biased site failure
    • Suspend IO on non-biased cluster after biased site failure
  • Scenario 2 (built on remote device)
    • Not highly available, only good for temporary use when VM must move immediately
  • Scenario 3 (temporary distributed device)
    • Storage Vmotion to a distribute device while in transit to the remote site
    • Then Storage Vmotion back to local storage in the remote site
    • Do this to regain some array functionality that VPLEX might not have

 

Failure Cases

  • N+1 configuration handles director failures transparently
  • Any WAN or remote cluster failure while Vmotion is in progress simply results in Vmotion being aborted

 

Rule-set Best Practices

  • Manage your rule-sets very carefully
    • Be aware of which cluster will win in the event of a failure
  • Place related VMs on the same data store so that they will move together
  • For any given data store, move all VMs at the same time
  • For the most critical applications, dedicate a data store to the VM

Permalink Leave a Comment

Converged Data Center: FCoE, iSCSI, and the Future of Storage Networking

May 12, 2010 at 2:53 pm (Information)

 

Stuart Miniman, Office of the CTO

 

The Journey to Convergence

The iSCSI Story

  • Transport SCSI over standard Ethernet
  • Reliability through TCP
  • SCSI has limited distance, iSCSI extended the distance

 

Non-Ethernet Convergence Options

  • Infiniband
    • Used broadly for High Performance Computing (HPC) environment
    • Low cost and ultra-low latency geared for server to server cluster
    • Separate use from general network (Ethernet) or storage (FC or Ethernet)
  • PCIe
    • Extension of the server bus to an I/O aggregation box
    • Not a standard, small players
    • Still using Etherhet and FC network and storage from an aggregation box

 

Maturation of 10Gb Ethernet

  • Allows replacement of n x 1Gb with must smaller number of 10Gb adapters
  • Single network allows for easier mobility for virtualization/cloud deployments
  • Simplifies server, network and storage infrastructure

 

Standards

  • 40Gb and 100Gb Ethernet (IEEE) standards will be completed in June 2010
  • 16Gb FC (T11) standard is targeted for completion at the end of 2010
  • 32Gb is in the works
  • Server Adoption of FC ~3+ years, of Ethernet ~5+ years

 

Protocols and Standards

Fibre Channel over Ethernet

  • Developed by T11, International Committee for Information Technology Standards (INCITS) T11 Fibre channel Interfaces Technical Committee
  • FC-BB-5 standard ratified in June 2009

 

Converged Enhanced Ethernet

  • Developed by IEEE Data Center Bridging (DCB) Task Group
  • Commonly referred to as Lossless Ethernet
  • IEEE standards targeting ratification mid-2010

 

iSCSI and FCoE Framing

  • iSCSI is SCSI functionality transported using TCP/IP for delivery and routing
  • FCoE is FC frames encapsulated in Layer 2 Ethernet frames over Lossless Ethernet

 

FCoE Frame Formats

  • Ethernet frames give a 1:1 encapsulation of FC frames, no segmenting FC frames across multiple Ethernet frames

 

FC-BB-6

  • Next step
  • Not required for multi-hope FCoE or other current deployments
  • Likely to support point-to-point configuration which allows two FCoE devices to communicate without going through an FCF (or switch)
  • Today initiator cannot talk directly to a target without a switch in between. FC-BB-6 is investigating this.

 

Lossless Ethernet

  • IEEE 802.1 Data Center Bridging (DCB) is the standards task group
  • Converged Enhanced Ethernet (CEE) is the industry consensus term
  • Link level enhancements (Priority Flow Control, Enhanced Transmission Selection, Data Center Bridging Exchange Protocol) are shipping in products today
    • PAUSE and Priority Flow Control
      • Classic 802.3x PAUSE is rarely implemented since it stops all traffic
      • New PAUSE known as PFC that can halt traffic according to priority tag while allowing traffic at other priority levels to continue. This creates lossless virtual lanes.
    • Enhanced Transmission Selection
      • Maintain low latency treatment of certain traffic classes
    • Data Center Bridging Exchange Protocol
      • Auto-negotiation for devices as they determine the link parameters
  • The CEE cloud or DCB-enabled LAN is only for the portion of your network that requires lossless Ethernet

 

Beyond Link Level

  • End-to-end
  • Congestion notification
    • IEEE 802.1Qau ratified
    • Allows a switch to notify attached ports to slow down transmission due to heavy traffic
  • Layer 2 multipathing
    • IETF TRILL – Transparent Interconnection of Lots of Links
    • Used with STP to provide more efficient bridging and bandwidth aggregation
    • Focuses on bridging capability that will increase bandwidth by allowing and aggregating multiple network paths

 

Solution Evolution

iSCSI

  • iSCSI was >15% of revenue ($1.8B in 2009) and >20% capacity in SAN market in 2009
  • 10Gb iSCSI solutions are available
  • iSCSI natively routable (IP)
  • iSCSI solutions are much smaller scale than FC

 

FCoE

  • FCoE with direct attach of server to Converged Network Switch at top of rack or end of row
  • Tightly controlled in 2009
  • First solutions are with FcoE aware Ethernet switch (FIP snooping)
    • Blade > FCoE switch > FC switch > Storage

 

Rack Area Network (RAN)

  • Cisco UCS
  • Vblock
  • VCE (virtual computing environment)

 

Timeline

  • End of year, you will see FCoE native in CX and DMX

 

Permalink Leave a Comment

Unisphere Hands-on

May 12, 2010 at 10:57 am (Information)

 

Overview

  • Supports all previous features of Navisphere and Celerra Manager
  • Unisphere can manage Clariions running FLARE 19 and above
  • Unisphere can manage Celerra systems with DART 6.0
  • CLI still separate and unchanged for CX and Celerra
  • Off-array packages also available for Unisphere
  • Slated the July 2010 timeframe

 

Experience Notes:

  • The reporting was snappy.
  • When I went to create a hot spare (RG), it was a little bit slow. Looks like it has to query the latest information on the RG IDs and available drives.
  • Thick pool LUNs will be released in FLARE 30. They will be like thin pool LUNs but will be fully provisioning. So basically they’re going the 3PAR provisioning method. Note that it appears that pool devices default to thick creation.
  • Expansion today is done with traditional metaLUN expansion. FLARE 30 will include expansion of pool-based LUNs, both thick and thin. After performing the expansion, the LUN list needed to refresh and the speed seemed similar to Navisphere (probably because this is the same naviseccli java call to the CX).
  • Pool best practices, adding drives to a pool should be multiples of 8. Not enforced but recommended.
  • System > System Information > SPA/B Tasks > Generate Diagnostic Files (this generates SP collects)
  • System > System Information > SPA/B Tasks > Get Diagnostic Files (download SP collects)

 

Recommendations:

  • Tiering policy – FAST, is “auto-tier” the default? Can you make “no data movement” the default?
  • It would be nice if things were auto-selected based on your selection in the GUI. So if I click Pool 9 and then click Create LUN, those options should be selected in the Create LUN pop-up window.
  • Progress bar in Create LUN would be nice if it gave more of a description of what was happening while you’re waiting.
  • Sorting in the Storage Assignment Wizard didn’t work on any of the columns. It works in other areas.
  • In the Storage Assignment Wizard, it would be nice if you could double-click the entry and have it check the box as well.
  • Filtering with just a number doesn’t seem to work properly. Entering just “5″ or “6″ didn’t filter as expected.

 

Feedback

  • Liked the “Results of the Storage Assignment Wizard” text feedback of what was completed.
  • The LUN view and immediate/snappy access to Analyzer was nice.
  • Access to quick reporting and filtering is awesome.
  • System > Hardware and the capability of clicking a component and it then being highlighted is very nice.
     

 

Highlights from the GUI

System Information

  • Manage Data Ports – view Clariion SP ports (Fibre/iSCSI, WWN/IQN)
  • Fault Status Report
  • Storage System Connectivity Status – traditional Connectivity Status
  • Power Savings – enables spin down if “Enable Storage System Power Savings” is checked off, it will also tell you what RAID groups are eligible for it

 

Service Tasks

  • These all require Unisphere Service Manager to be installed
  • Replace Faulted Disk
  • Update CLARiiON software
  • Register Storage System
  • View Advisories
  • Launch USM

 

Hosts > Host Management

  • Hypervisor Information Configuration Wizard – helps you setup integration of VMs from current ESX servers. Wizard steps are:
    1. Select the storage system whose LUNs are in use by the ESX servers
    2. Select management mode in use with your ESX servers
    3. Add/edit/remove vCenter Server managed ESX servers (if applicable)
    4. Add/remove ESX servers not managed by vCenter server (if applicable)
    5. Apply settings to update your Unisphere virtual server environment
  • Failover wizard – helps guide you through the process of setting up failover software (PowerPath)

 

Replicas

  • Snapshots
  • Clones
  • Mirrors
  • SAN Copy Sessions
  • RLP, Write Intent Log
  • Can launch RM from here (need to have RM installed)

 

Monitoring

  • Fault Status Report
  • Reports Wizard
  • High Availability Verification Wizard
  • Trespassed LUNs Status

 

Celerra

  • Unisphere is basically a portal to the old management web pages from Celerra Manager. A lot of the links pop up the old web pages.
  • There are some new pages that are ported into Unisphere

Permalink Leave a Comment

VPLEX Birds-of-a-Feather

May 12, 2010 at 10:46 am (Information)

 

GA was April 15th, 2010.

No snapshotting features today

 

Front-end Limitations

  • 400 initiators, 200 hosts – this is for 1/2/4 engine VPLEX
  • This is the supported number today, plans are to have it increase

 

There are futures to add asynchronous WAN capabilities into the product in order to extend the distance, but not in the product at GA.

 

You can bring in volumes and take out volumes with little disruption. Need to understand this process

It can encapsulate 3rd party vendor storage and adds VPLEX functionality to those devices

 

SCSI2/3 reservations are supported

 

Deployment options

  • Encapsulation – take host offline, re-present to VPLEX (no meta data is changed), and re-present to host
  • Didn’t catch the other one, had to do with LVM

 

Top use cases

  • Extraction of an application from one data center to another
  • Ability to react quickly to movement requirements
  • Simplicity of provisioning
  • Resiliency (ability to lose an entire array)

 

Metro Connectivity

  • Native out of the box today is fibre channel, so you will need to do have either DWDM/CWDM or do FCIP conversion
  • Cisco Nexus OTV will play nicely with VPLEX

 

Licensing

  • Sold on a capacity basis, first 10TB is included
  • Tier-based license (curve to it)
  • Base is VPLEX Local
  • VPLEX Metro only covers the capacity that you want federate, otherwise Local for the other
    • You will need that Metro license at EACH site. This is in preparation for when it goes to multi-site Metro.
  • Offering a subscription service (1 year for an amount of capacity), can be renewed (turns CapEx into OpEx)
  • Suppose you have 100TB site A, 50TB site B, and then 20TB that will federate between both sites.
    • 100TB Local license + 50TB Local license
    • 2x 20TB Metro license

 

If I have VPLEX, do I still need SRDF?

  • Barry Burke calls this the hand grenade question
  • VPLEX will do synchronous mirroring
  • VPLEX is for the requirement to active/active access with mirroring
  • SRDF can still be used for BC purposes

 

http://www.emc.com/campaign/global/vplex/index.htm

http://thestorageanarchist.typepad.com/weblog/2010/05/3003-to-boldly-go.html

Permalink Leave a Comment

Flash Architecture

May 11, 2010 at 4:14 pm (Information)

 

Flash drives are like little storage systems

  • Memory buffer
    • Buffers hold index of all locations
    • Buffers incoming writes
    • Buffer resiliency
      • Power capacitors maintain power to the buffer in the event of system power failure
      • Contents are then written to the persistent store if power fails
  • Pages
    • Cells are addressed by pages
    • 73GB and 200GB use 4KB pages
    • 400GB use 16KB pages
    • Page contents are contiguous address space, like SP cache pages
    • Two 2KB IO in a 4KB flash page but must be contiguous WRT LBA
  • Blocks
    • NAND storage is mapped like a filesystem
    • Pages are grouped together into blocks
      • Not to be confused with SCSI or filesystem blocks
      • Multiple page sin a block jumpled together
      • Addresses of pages in a block do not have to be contiguous
    • Writes to NAND are done at block level
    • Block images are held in buffer until the block is full, then written to previously erased block on disk
    • There must be an erased block available for the write
  • Channels
    • Paths to physical devices (chips)
    • Flash drives have multiple channels, discrete devices can be read from or written to simultaneously
    • Large I/O is striped across the channels

 

Page States

  • Flash as Mapped Device
    • Workload can affect page state
    • Page state can affect availability of blocks
    • Availability of free (erased) blocks determines write performance
  • Valid state: contains good data (referenced by host and flash)
  • Invalid state: contains stale data
  • Erased state: block is not in use
  • Pages become randomized due to random writes
  • Valid or invalid if referenced by flash meta data
  • For example
    • A file that occupied two blocks on the chip gets written to
    • The first block gets written to the buffer and the block in the NAND gets marks as invalid

 

Reserve Capacity

  • Some percentage of capacity is reserved and not included as user addressable capacity
  • The capacity will be used to provide ready blocks for incoming writes
  • Sustained heavy writes can saturate a Flash drive
  • Now the drive will need to perform erase operations in idle cycles

 

Erasing Blocks

  • The drive will erase blocks during idle periods
  • To be erased, a block must have all invalid pages
    • Every valid page in a block must first be written to another block
    • That requires additional activity
      • Read in pages to buffered block
      • Erased old locales in NAND
      • Write out consolidated block to NAND
      • Basically this is defragmentation (housekeeping)

 

Consolidation

  • Do flash drives slow down over time?
  • Free space is a factor but so is time because it gets more and more fragmented over time
  • Total capacity utilization can affect the response time of sustained writes
    • Higher capacity utilization results in more valid pages in each block
    • Over time, distribution of valid pages becomes more random and capacity utilization increases
    • If blocks have a high percentage of valid pages, it is more difficult to consolidate and erase a block
    • The drive therefore needs more time to do housekeeping

 

Issues

  • >20% random write workload can have pretty significant affect on flash drive importance

 

Backfill

  • Small writes and backfill, aka write amplification
  • Writing an I/O smaller than the page requires read-modify-write
  • This therefore doubles the workload on the drive
  • This makes 73GB and 200GB flash drives better as they use 4KB page sizes and don’t suffer from this penalty as much as 400GB drives do with 16KB page sizes

 

Flash and Write Cache

  • Original guidance: flash does not need SP cache
  • New guidance: flash can help SP cache in many cases
  • Experience: many uses of flash + SP cache in the field
  • OK to use the SP cache for flash drives now and is a benefit in many cases

 

Best Practices

  • Best use
    • High random read rates
    • Smaller I/O
    • I/O patterns that are not optimal for cached FC implementations
  • Databases: 4-15 flash drives typical
    • Indexes and busy tables
      • Biggest disk-for-disk increase in ready-heavy tables (10-20x)
    • Temp space
      • But turn on SP write cache because of the write/re-read/write nature of temp databases
    • Some clients using Flash for write-heavy loads
      • Use SP cache for better response time
      • Flash flushes cache faster
  • Really big databases are a little different
    • Up to 30 flash drives
    • These bypass SP write cache to maximize write throughput
  • Oracle ASM 11gR2
    • Users can differentiate groups as FAST, AVERAGE, SLOW
  • Messaging (Exchange, Notes)
    • Database to flash and all users benefit
    • Use R5 for Exchange on flash
      • Turn on SP write cache
      • Writes flush to R5 on flash faster than R10 on FC
      • Reads are likely better distributed than from R10 on flash
      • Flash rebuilds faster than FC and impact is less
  • Ok to use
    • Databases
      • Oracle Flash Recovery: SATA do fine here, more economical
      • Redo logs: FC is sufficient, cost less
      • Archive logs: FC even SATA do fine
    • Media
      • Editing configurations are the best fit for flash in media
      • Some advantage to multi-stream access
      • FC will give more predictable write performance at a micro level due to flash’s internal structure
    • Any time power/cooling is an issue

 

 

 

Permalink Leave a Comment

Architecture Deep Dive for VPLEX

May 11, 2010 at 2:41 pm (Information)

 

What is it?

  • Storage federation platform that extends storage beyond the boundaries of the data center
  • Distributed, peer-to-peer, fault-tolerant storage system

 

What problem does it solve?

  • Simultaneous access to storage devices from two distinct locations
  • Planned data mobility within, across and between data centers

 

What’s unique?

  • Clusters can scale out and scale up
  • Distributed coherent cache enabling distributed storage access
  • Designed for asynchronous distances and multiple clusters

 

VPLEX Director

  • 8x 8Gb/s FC front-end
  • 8x 8Gb/s backend
  • Dual quad-core processors (2.33Ghz)
  • 32GB memory, about 25GB for cache
  • 4x 8Gb/s FC com ports (intra- and inter-cluster)
  • 4x 1Gb/s Ethernet com ports (inter-cluster)
  • Standard EMC hardware
  • GeoSynchrony is the operating system
  • Looks like a VMAX engine

 

VPLEX Cluster

  • Rack containing 1, 2, or 4 engines
  • Plus a management server (and standby)
  • Intra-director fibre channel switches
  • Power and battery backup for each engine
  • Cabling for management and intra-cluster com
  • VPLEX Local is one cluster
  • VPLEX Metro is two clusters

 

Major Subsystems

  • Front-end – storage view
  • Cache – distributed coherent cache
  • Device Virtualization – virtual volumes
  • Back-end – storage volumes

 

Lease Rollover

  • Can be used to migrate data from one array to a new array
  • Import new storage array/volumes
  • Production volumes are mirrored
  • Original array can be pulled out while the virtual devices now point to the new array

 

Application Migration

  • Enable application at the remote site
  • Cache activity now functions at source and remote site
  • Disable application at source site
  • Cache activity migrates to the remote site

 

Cluster failure and cluster partition

  • Cluster failure and partition is hard
    • What does A know and when does it notice it can talk to B any more?
    • Easy if B’s really dead (then it can’t mess us up), but what if it’s still alive?
    • How do A and B decide what to do?
  • Cluster Bias
    • Each distributed volume has a bias that favors one site over another
  • What happens to all the writes to the other side?

 

A Closer Look at Distributed Mirrors

  • Each disributed mirror has two dirty region logs
    • One records writes that couldn’t be committed locally
    • One records writes that couldn’t be committed to the remote site
  • Each distributed mirror also has a bias and timeout for applying that bias

 

Failure of cluster without bias

  1. IO goes to cluster A and B
  2. Something bad happens to A. A can’t talk to B and thus starts a timer. IO pauses briefly
  3. The timer expires, IO resumes for hosts at A. Writes are logged to the DRL.
  4. Eventually B is repaired and brought back online. A and B reconnect.
  5. A begins to resynchronze B based on DRL. Distributed mirror and cache coherency present A’s version of the data
  6. Resynchronization finishes, DRLs are now empty

 

Failure of cluster with bias

  1. IO goes to cluster A and B
  2. B can’t talk to A. IO pauses indefinitely.
  3. Admin instructs B to resume IO, and starts passive application at B. Writes are logged in B’s DRL
  4. A is repaired and brough back online. And B reconnect. A notices B continued in spite of bias settings.
  5. B begins to resynchronze A based on DRL. Distributed mirror and cache coherency present B’s version of the data
  6. Resynchronization finishes, DRLs are now empty

Permalink Leave a Comment

Transitioning to Auto-Provisioning Groups

May 11, 2010 at 11:54 am (Information)

 

Auto-provisioning groups look to replace the traditional symmasking functionality and definitely simplifies the provisioning process.

 

3 Grouping Objects

  • Storage groups: symaccess -sid 1234 create -type storage -name mystorage -devs 050:053
  • Port groups: symaccess -sid 1234 -type port -name myports -dirport 7e:0, 8:e0
  • Initiator groups: symaccess -sid 1234 create -type initiator -name myinit -file initiator_list

 

Must define a masking view

  • Associates devices in the storagr group to the port groups to the intiator group
  • Uses dynamic lun addressing, same lun address for all paths
  • symaccess -sid 1234 create view -name dbserver -storgrp mystorage -portgrp myports -initgrp myinit
  • Simplifies updates/changes

 

AccessLogix Database

  • ACLX gatekeeper device (instead of the VCM database)

 

Limited symmask Compatibility Mode

  • Both the VCM and ACLX database contain masking entires, resides in SFS, accessed through gatekeeper device
  • However, data structures are completely different
  • VCM database entries are based on one-to-one relationship between initiators and front-end ports
  • ACLX database entries are based on many-to-many relationship between groups of initiators and ports
  • Compatibility mode will take the traditional VCM symmask command but it still convert it to the auto-provisioning groups above but with one entry in each group
  • Restrictions: once you start using symmask, you have to continue using symmask. If you use symaccess, symmask will no longer work because symmask is looking for one-to-one relationship
  • Best practice is to transition over to auto-provisioning groups

 

Storage Templates

  • Define capacity, protection, drive type, and other criteria
  • Specify template rather than specific devices when initially provisioning storage or adding new capacity
  • If the devices do not exist, it can optionally create the device

Permalink Leave a Comment

Virtual Provisioning Best Practices with Symmetrix VMAX

May 11, 2010 at 9:02 am (Information)

 

VP simplifies drive and DA workload distribution but doesn’t distribute performance across Fas.

 

Pool Count, Fewer Pools for Easier Management

  • Segregation by Application
    • Separate applications that require consistent performance
    • Avoids unexpected disk queuing if critical apps have dedicated disks
  • Segregation by Use
    • Separate database tables and log
    • Separate backup media, such as clones (allows more space efficient protection, i.e. R5/R6)

 

Pool Configuration, Best Practices for TDATs

  • Equality for all TDATs (data devices)
    • Place on drives that have same rotational speed
    • Use common size to avoid uneven data distribution
    • Disk hyper count (TDAT count * RAID member count) should be multiple of disk count to spread load evenly within a pool
  • Fewer, larger devices
    • Fewer objects to manage
    • 8 disk hypers per physical disk at a minimum for adequate disk queue depth (too few could cause starvation of I/O requests to the disk)

 

Considerations with Availability

  • Definitions
    • Mean Time to Data Loss (MTTDL)
    • Mean Time Between Failures (MTBF)
    • N (Number of Drives)
    • MTTR (Mean time to Replace a failed drive)
  • Mirrored and RAID5
    • MTTDL = (MTBF of the disk)^2 / [N * (N-1) * MTTR)]
  • RAID6
    • MTTDL = zero (from simultaneous dual drive failures)

 

Considerations with Performance

  • However, R6 will have performance implications with respect to write requirements
    • Mirrored = 1x (2 operations)
    • R5 = 2x (4 operations: read old data, update data, read parity, write parity)
    • R6 = 3x( 6 operations: read old data, update data, read parity, write parity, read parity, write parity)
  • 12 tracks is the basic unit, that will go to a single TDAT, sequential writes work very well with R5 3+1 because the full stripe is 12 tracks.

 

TDEVs

  • Devices that are presented to the host that are over-subscribed
  • Allocate extent on first write, performance implication

 

First Write

  • When the TDAT is bound, you have an option for one of three cases
  • VTOC is done on a track by track basis
  • Case 1: Unallocated
    • Allocate extent
    • VTOC track, pad if necessary (up to 64KB) (if it’s only 8KB, it will have to pad the rest)
    • 5-7ms latency in the beginning
  • Case 2: Pre-allocated
    • VTOC track, pad if necessary (up to 64KB)
    • 3-3.5ms latency in the beginning
  • Case 3: Pre-written
    • Clear to write!
    • 2ms latency in the beginning
  • All three cases even out over time, depending on usage

 

Stripe for Concurrency

  • Thin pool 480 R5 3+1 data devices on 480 drives, 4 engine VMAX, upwards of 165,000 IOPS with 500 thin devices
  • Increase thread count
  • Increase TDEV
  • Increase FA slices/ports/targets

 

Metas

  • Conceptually, striped and concatenated meta devices should perform equally
  • Reality is there can be differences due to effective I/O queue depth
    • Random read was equal for 8KB (out-performs non-meta though)
    • Random write has striped slightly higher but still close (way out-performs non-meta)
    • Sequential write begins to out-perform significantly as queue depths grow upwards to 64
  • Striping gives the highest performance
  • However, if meta expansion will be a requirement, then concatenated will be optimal
  • Alternatively, do the striping at the host level
  • Tune your applications to Symmetrix architecture
    • 8-16KB random request size
    • 64-256KB sequential request size (Symmetrix track aligned)
    • 256KB is the largest the Symmetrix will send to a disk

Permalink Leave a Comment

Next page »

Follow

Get every new post delivered to your Inbox.