en English
Share on facebook
Share on twitter
Share on linkedin

Revealed: SSD “reliability” in the end is not reliable anymore?

SSDs have become the preferred solution for more and more enterprises due to their high read and write performance and low latency, and play an important role in databases, virtualization, application acceleration, big data, cloud computing and even artificial intelligence. Enterprise SSDs often need to operate in demanding environments with high concurrency, high stress, and 24/7 operation, and their reliability is one of the key concerns for enterprise users.

Reliability refers to the ability of a component or system to continue performing its intended function for a specified period of time under specified operating conditions. For enterprise-class SSDs, it is a very important metric that not only directly determines core indicators such as product shipment yields and failure rates, but also plays a key role in the protection of data availability and consistency.

01. Reliability quantifiers —— MTBF

The reliability of SSDs is usually measured quantitatively by MTBF (Mean Time Between Failures), which is the ratio of the cumulative operating time to the number of failures during the total use of the product. It reflects the time quality of the product, the less the product failure, the higher the MTBF, the higher the reliability of the product.

Compared to consumer SSD products, enterprise SSDs face higher challenges in terms of reliability. According to the OCP (Open Compute Project), the MTBF for enterprise SSDs deployed in data centers should be 2,000,000 hours, which is the current standard for enterprise SSDs. However, MTBF is something that needs to be verified by actual runtime testing and cannot be derived from thin air. Under traditional methods, it is clearly impossible to complete multiple 2-million-hour validations. So, how is this 2-million-hour MTBF obtained?

The answer is based on certain sample size and statistical extrapolation by acceleration factor acceleration (such as write volume acceleration, operating environment temperature acceleration) within a certain time period. The process simulates a typical user scenario, verifies the theoretical value through actual testing, and accepts the product quality in advance. The rigorous run-time verification will directly determine whether the MTBF “reliability index” is really reliable.

02. MTBF characterization period

Like most electronic products, SSDs also conform to the characteristics of the bathtub curve (lapse rate curve), which is divided into three critical periods:

  • Infant Mortality

When a product is first manufactured and powered up for use, factors such as yield rate can cause it to fail at a high rate. To ensure that the SSDs delivered to customers meet enterprise reliability standards, enterprise SSD manufacturers will run aging tests on all products in the production line for a certain length of time to maximize the exposure of possible early failures and ensure that customers do not have early failure problems with their products.

  • Random Failures or Normal Life

This phase corresponds to the official shipping product, where the product failure rate is lower and more stable. This is the period described by the MTBF, i.e., the stable use phase of the product.

  • Random Failures or Normal Life

This phase corresponds to the official shipping product, where the product failure rate is lower and more stable. This is the period described by the MTBF, i.e., the stable use phase of the product.

03. MTBF = MTTF?

In addition to MTBF, you may have heard another reliability description term – MTTF. for a maintainable device, MTBF = MTTF + MTTR, and the relationship between the three is as follows:

  • MTTF (Mean Time To Failure):

Refers to the average time between two system failures, averaged over all time periods between the start of normal system operation and the occurrence of a failure. MTTF =∑T1/ N;

  • MTTR (Mean Time To Repair):

Refers to the average value of the time period between the occurrence of the system failure and the end of the repair. MTTR = ∑(T2+T3)/ N;

  • MTBF (Mean Time Between Failure)

Refers to the average value of the time period between system failures (including fault repair). MTBF = ∑(T2+T3+T1)/ N.

Because MTTR is usually much smaller than MTTF, MTBF is approximately equal to MTTF.

04. MTTF theoretical calculation formula, how does 2,000,000 hours come about?

In the simplest case, the MTTF calculation follows the following formula:




1. Ai is the acceleration factor of SSDi;
2. tis the test time of SSDi;
3. nf is the number of failed SSDs;
4. a(confidence limit,60%);
5. x2 (chi-squared distribution).

The acceleration factors in the above equation are usually divided into 3 categories:

  1. Unaccelerated factor: A=1, usually used for firmware failures;
  2. 2. TBW (Total Bytes Written) acceleration factor: life acceleration by increasing data writing intensity;
  3. Temperature acceleration factor : Accelerates the occurrence of faults by increasing the temperature of the test environment.

TBW (Total Bytes Written) acceleration factor

TBW is the SSD life unit, for example, if the life of 1.5 DWPD, user capacity 3.84TB PBlaze6 SSD, its 5-year total data write volume (that is, the field of field deployment write volume) is 10.5 PB, corresponding to 5.76 TB of data write volume per day. if increase the daily data write volume (accelerated write volume stress), equivalent to accelerate the consumption of SSD life, can accelerate the failure. The TBW acceleration factor is calculated as follows:

Assume a user capacity 100G SSD with a product specification that defines SSD life as 175TBW for 5 years (43,800 hours) in a typical usage scenario. It writes 130TB of data in 1008 hours with a write amplification of 1.2, then the TBW acceleration factor is 32, and if more data is written in a short period of time, then the TBW acceleration factor will increase accordingly.


Temperature acceleration factor

NAND, due to its inherent characteristics, decreases in data retention as the temperature increases. The Arrhenius Equation states that 1 year (8670 hours) of SSD at room temperature of 40°C is equivalent to 52 hours in an aging chamber at 85°C.

JESD 22-A108 defines the effect of temperature on SSDs over time. The HTOL (High-Temperature Operating Life) test is performed to determine the reliability of SSD operation under prolonged high-temperature conditions. The protocol requires that SSDs be tested at a junction temperature pressure of 125 °C if not specifically requested. However, enterprise-class SSDs are generally designed with high-temperature protection logic to prevent high temperatures from causing degradation of NAND data retention and component damage, so the actual operating temperature of SSDs will not reach 125°C.

For the temperature acceleration factor, the calculation is as follows:





1. Ea is the activation energy of the failure model, generally 0.7 eV;
2. k is the Boltzmann constant, 8.617 x 10-5 eV/°K;
3. T₁ is the working temperature (standard value is 55°C or 328°K);
4. T₂ is the test acceleration temperature.

MTTF calculation examples

Assuming a sample size of 400, a test time of 1008 hours, an acceleration factor Ai = A(TTW) * A(T) of 10, a number of failures of 0, and a confidence level of 60%, then MTTF = MTBF = 4,400,000 hours.

Note that MTBF is strictly temperature-dependent. This is also mentioned in the OCP Datacenter NVMe® SSD Specification:

  • MTBF 2,500,000 hours (AFR ≤ 0.35%), corresponding to SSD operating temperature of 0°C~50°C;
  • MTBF 2,000,000 hours (AFR ≤ 0.44%), corresponding to SSD operating temperature of 0°C~55°C.

But there is always a gap between theory and reality. In reality, the MTBF test in the product sense, it is difficult to reach 10 times the acceleration factor, TBW acceleration factor can only be used to test the life of NAND particles, the actual test also needs to consider the reliability of hardware parts such as circuitry and physical interfaces. And this part can only be accelerated by temperature. In practice, the MTBF=2 million hours test, need to use at least 2000 pieces of samples in the role of acceleration factor, run a full 1000 hours or more.

05. What is the relationship between MTBF and AFR?

In addition to MTBF indicators, there are other quantitative reliability indicators, such as Failure Rate (λ) and Annualized Failure Rate (AFR), where AFR and MTBF can be transformed into each other.

  • Failure rate λ: When selecting key components for SSD, it is necessary to ensure that the failure rate λ of each component meets the standard. The MTBF definition is more straightforward than the failure rate metric and is more applicable to express system-level reliability;
  • AFR: Annualized Failure Rate, which gives a better idea of the chances of a drive failure in any given year.

The MTBF and AFR conversion equations are as follows:

  • MTBFhours = 1/λhours
  • MTBFyears = 1/(λhours*24*365)
  • AFR = 365*24hours*λhours = 8760hours/MTBFhours

The values of MTBF and AFR correspond to the following:

Enterprise SSD product reliability MTBF ≥ 2,000,000 hours (@55°C), which translates to an annualized failure rate of AFR ≤ 0.44%, corresponding to an FFR (Functional Failure Requirement, the accumulated functional failure rate of the SSD over the entire wear life time frame, with a 5-year warranty as a reference) ≤ 2.2%.

Enterprise SSD product reliability MTBF ≥ 2,000,000 hours (@55°C), which translates to an annualized failure rate of AFR ≤ 0.44%, corresponding to an FFR (Functional Failure Requirement, the accumulated functional failure rate of the SSD over the entire wear life time frame, with a 5-year warranty as a reference) ≤ 2.2%.

06. Validation of MTBF ( Memblaze’s own test platform, Whale Systems )

In the field of data reliability technology, Memblaze has self-developed MemSolid technology set to ensure the consistency and reliability of enterprise-class data. Through full-path data protection, LDPC soft judgment decoding error correction technology, metadata cross-Channel backup protection, inter-Die dynamic RAID5 recovery mechanism for bad block data, as well as read-over protection and over-temperature protection technologies, PBlaze achieves sustainable data consistency protection to guarantee that enterprise business-critical data assets are always in a safe and reliable storage environment.

In order to ensure that the shipped SSD products can meet the MTBF standards, Memblaze has developed its own MTBF test platform using more than ten years of experience in the SSD field and understanding of users’ practical applications. —— Whale system

It is built with reference to JEDEC standards and is suitable for R&D (DVT), Environmental Stress Test (EST), Data Retention, Production (Aging, ORT, Ongoing Reliability Testing), and RDT testing of PCIe SSDs. The Whale system is pre-programmed with test cases that are closest to real customer use scenarios and uses reasonable acceleration factors to run tests on RDT stage products for a long time, which becomes a quality assurance before mass production.

According to Memblaze’s shipment and actual failure rate statistics, the actual cumulative product failure rate (CFR, Cumulative Failure Rate) of PBlaze series SSDs is much lower than the nominal annualized failure rate.

After more than a decade of deep work in the SSD industry, Memblaze has formed a rigorous design and strict quality control system from chip, software, hardware, production, and shipping, which can ensure that PBlaze series enterprise SSDs provide customers with excellent reliability and also greatly reduce customers’ system operation overhead (OPEX) and total cost of ownership (TCO), and Memblaze will continue to polish with the spirit of craftsmanship to live up to its expectations!

References:

  1. Memblaze SSD Reliability: MTTF/MTBF/AFR/CFR/RDT
  2. JESD218, Solid-State Drive (SSD) Requirements and Endurance Test Method
  3. JESD 22-A108, Temperature, Bias, and Operating Life
  4. OCP Datacenter NVMe® SSD Specification Version 2.0
  5. Calculating Reliability using FIT & MTTF: Arrhenius HTOL Model

Leave a Comment

Your email address will not be published. Required fields are marked *

8 + 15 =

Let's Have A Chat

Learn How We Served 100+ Global Device Brands with our Products & Get Free Sample!!!

Email Popup Background 2