Drive Density Kills RAID

While it lowers the cost of storing data, drive density kills RAID as a means of protecting against media failure. Today, both flash and hard disk media technologies are available in ~16TB capacities. These high capacity devices expose various weaknesses in traditional RAID algorithms and make its use suspect today and impractical in the near future.

On our next Whiteboard Wednesday, we’ll be discussing RAID vs. Erasure Coding and dive deep into why drive density kills RAID. You can sign up for the interactive virtual whiteboard session here. In this column, we’ll provide an overview of the RAID problems with drive density. Please come back for the whiteboard session to learn more about RAID and Erasure Coding’s pros and cons.

Drive Density Kills RAID Hot Sparing

The good thing about RAID is it continues to provide access to data when a drive fails. The challenge, however, is quickly getting the storage system back into a protected state before another drive fails. The majority of enterprise storage systems use global hot spares to rebuild the RAID group automatically. Hot Spares are dedicated drives that sit idle, waiting for a failure. When a drive fails, these hot spares are logically assigned to the affected RAID group, so the customer doesn’t have to do anything to start the rebuild process.

There are three fundamental problems with hot sparing. First, the hot spare needs to be replaced relatively quickly after the first failure; otherwise, the system is running without a hot spare. Replacing hot spares is particularly problematic during the current pandemic because access to data centers is no longer routine.

Second, because traditional RAID can’t mix drive sizes within RAID groups, the customer needs to have a hot spare for each type of drive in the storage system. The inability to mix drive sizes also means that drives must be of the same capacity when a customer wants to expand their storage system. Alternatively, they can create a new RAID group out of higher capacity drives and manually migrate data to the new RAID group. Using hot spares is also inefficient. Hot spares require setting aside what can now be 16TBs of capacity, waiting for something to go wrong. It is not uncommon for customers to have 50 or 100TB of idle capacity in hot spares. In most cases, it eliminates any capacity gains the customer will get from using deduplication or compression.

The third problem is that hot spares contribute to the slow RAID rebuilds, which we address next.

Drive Density Kills RAID
The good thing about RAID is it continues to provide access to data when a drive fails. The challenge, however, is quickly getting the storage system back into a protected state before another drive fails.

Drive Density Kills RAID Because of Slow Rebuilds

One of the biggest challenges facing data centers is the time it takes to rebuild a RAID group. Even all-flash arrays are seeing rebuild times creep up as capacities increase. But the biggest challenge is the rebuild time on hard disk drives. A RAID group built from 12 or 16TB hard drives, typically measure rebuild times in days, not hours. Many potential customers who come to us looking for rebuild time relief are stating RAID rebuild times of WEEKS for these high capacity drives. Slow rebuild times cause many problems, some obvious others not so obvious.

The most obvious and pressing problem with slow rebuilds is the time it takes to complete the rebuild. If you have your system set to a single parity drive, you cross your fingers the entire time the rebuild is occurring, hoping that another drive doesn’t fail.

Why are RAID rebuilds so slow?

Part of the problem points back to the hot spare concept. When that hot spare is logically moved into position, all the other drives are writing simultaneously to that one drive. Also, most RAIDs have to do a sector by sector rebuild of the drive, so even if that 16TB drive had only 5TBs of data on it, it doesn’t rebuild any quicker than a drive that has 15.5TB of data on it. As both flash and hard drives’ capacity continues to increase, the problem continues to worsen.

All this simultaneous writing, while still managing other tasks, puts a heavy burden on the drives. Consequently, the chances of another failure increase dramatically. With a single parity design and high-density drives, rebuilds can take hours or days and the concerns of a double drive failure are legitimate.

The only solution storage vendors, who count on RAID, provide to customers is dual parity RAID. At the expense of capacity efficiency and write IO performance, these RAID groups can survive two drive failures. The challenge is as capacities and RAID rebuild times continue to increase, the likelihood of multiple and simultaneous drive failures will also increase. The challenge is that there are no triple parity RAID designs on the market today. Even if triple parity RAID technology were available, the impact on write performance would be devastating.

Drive Density Kills RAID
There are no triple parity RAID designs on the market today

The Answer is vRAID

The answer to the challenges of increasing drive density is vRAID. When we founded StorONE, we had two choices regarding providing customers with protection from media failure, RAID, or Erasure Coding. We rejected RAID for the reasons described above. Erasure Coding, however, was ideal for reasons I will explain during our Whiteboard Wednesday session, but using it would impact our software’s high-performance IO capabilities.

Our answer to the high-density challenge? Rewrite Erasure Coding from scratch, removing its inefficiencies and optimizing it for high-performance storage. Every performance benchmark you’ll see from us is on volumes with vRAID active. With vRAID, you can set redundancies on a per-volume level, mix drive sizes within the RAID group, and experience the fastest rebuilds in the industry, no matter the media type.

To learn more about StorONE’s data protection suite, checkout “Invincible Storage” or watch our Lightboard video “Driving Down Storage Costs with Better Primary Storage,” and of course, make sure to register for our Whiteboard Wednesday session “RAID vs. Erasure Coding“.

Posted in
George Crump

George Crump

George has over 25 years of experience in the storage industry, holding executive sales and engineer positions. Before joining StorONE, he was the founder and lead analyst at Storage Switzerland.

What to Read Next

How to Bypass the Compromises of Legacy RAID Architectures

Traditional storage architectures force the IT professional to sacrifice either on cost or on performance, in order to obtain data protection services such as snapshots and erasure coding. This is no longer acceptable in a business environment that increasingly does not tolerate compromise on data integrity or on application performance, and that requires maximum levels of utilization of hardware resources. […]
Read More

Volume Level Erasure Coding to Avoid Storage Tradeoffs

We previously explored the tradeoffs that traditional storage snapshots require in terms of cost and performance, and how StorOne has written its snapshot algorithms to avoid forcing customers to choose between obtaining snapshots, or delivering on required levels of performance, and staying within the budget. In this blog, we will evaluate a similar problem that erasure coding […]
Read More

Data Integrity: The Backbone of Competitive Advantage

Data is the foundation of business advantage in today’s economy. Analytics and artificial intelligence (AI) are helping businesses to uncover new competitive opportunities and to operate in a more efficient and streamlined fashion. At the same time, requirements for data privacy are higher than ever before, because consumers are becoming more discerning about how their […]
Read More

EXPERT CLASS

Primary Storage Data Protection

JOIN OUR

Reserve Pricing

Reserve Pricing

  • This field is for validation purposes and should be left unchanged.

Access the 451 Research Brief

  • This field is for validation purposes and should be left unchanged.

Access the ESG Showcase

  • This field is for validation purposes and should be left unchanged.

Access the IDC Report

  • This field is for validation purposes and should be left unchanged.

Access the S1:Optane Performance Report

  • This field is for validation purposes and should be left unchanged.

Check Back for

All-Flash Array.next

00
Months
00
Days
00
Hours
00
Minutes
00
Seconds

Get Notified at Launch

Learn More About the Hidden Cost of Dedupe

  • This field is for validation purposes and should be left unchanged.