en English

Factors affecting SSD reliability, working principles, and solutions

Table of Content

Flash memory technology is not a new technology for everyone. However, from the emergence of each technology to widespread application, it takes a long time to compete with the existing technology, and finally can be deeply recognized by users through practice.  Flash memory is a typical example. While users are happy about the high performance of flash memory, they are also worried about the reliability, erasure times, and failure rate. So, the author feels very necessary to pass on the following points, talk about flash memory this aspect of the problem, and alleviate everyone's doubts and worries. 

Why do flash media have erasure life limits?


The basic unit that holds data in flash media is called a Cell.  Each Cell records different data by injecting and releasing electrons.  Electrons in and out of the Cell will wear the Cell.  As the wear increases, the probability of electrons escaping from the Cell increases, causing the data stored in the Cell to jump.  (For example, a Cell might start with a binary of 10, and then read the Cell for a while, and the binary might become an 11.)

Data stored in the flash memory has a probability of changing, so it must be used with the error-correcting code (ECC).

  • When data is written, the ECC engine calculates the redundant data based on the original data and saves both the original and redundant data.
  • When data is read, the original data and redundant data are read together, errors are checked and corrected by the ECC engine, and the correct original data is finally obtained.

Original DataECC engineOriginal Data + Redundant Data

Original Data + Redundant Data (Some data may jump)ECC engineCorrect Original Data

The number of jumps in the data stored in flash memory increases with the number of erasures. When the number of erasure times reaches a certain threshold, the number of jumps in the data stored in the flash memory will increase to the extent that the ECC engine cannot correct it, and thus the data cannot be read. This threshold is the maximum number of erasures on flash.

Why can LDPC improve the erasable life of flash memory?


The data stored by flash memory will jump after being saved for a period of time. And the number of jumps increases with the increase of erasure times. Therefore, an ECC engine is required inside SSDs to check and correct data errors.

In the SSD field, the current standard ECC algorithm is the BCH algorithm (named after the initials of the three authors), which can meet the majority of SSD error correction requirements. The maximum number of erasures claimed by flash media is based on the BCH algorithm.

However, with the widespread application of TLC media and 3D NAND, the error rate of the same data block at the end of its life will greatly increase, and the error correction ability of BCH coding is very weak, which also makes LDPC error correction algorithm in the SSD field has a useful place.

BCHLDPC
Read DelayLowLow
Decoding DelayFT – ShortFT – Short
Controller ComplexityMediumHigh
ECCLowHigh

LDPC algorithm is an algorithm with a strong error-correction ability (compared with BCH, it can correct more data jumps) and high complexity. It was first applied in the communication industry.

In 1963, A paper named “Low-Density Parity-Check Codes” by Gallager described the algorithm idea in detail. Since then, LDPC was born. LDPC coding has been widely used in optical communication, satellite communication, communication, and other fields, which can be said to be a very mature error correction algorithm.

How is SSD performance reflected?


The performance of a storage system is mainly reflected in two indicators: IOPS refers to the number of I/OS processed per second, and latency refers to the processing speed of the storage system after receiving I/OS.

SSDHDD (15K RPM)
IOPS>10000220
Latency0.2ms5ms
The above table compares the performance of SSDS and traditional HDDs. SSDs are much better than HDDs in terms of IOPS and latency.

In terms of IOPS, the IOPS provided by an SSD is the same as that provided by high-performance HDDs. However, the low latency of SSDs cannot be provided by HDDs.

Is it true that SSDs break after being written thousands of times?


Of course, this is not true. The SSD writes to a new physical address each time it handles data writes so that all of the flash physical space is used evenly.

For example, if a 600GB SSD has 10,000 writes to its flash media, the SSD can write 6PB of data (600GB x 10,000).

However, the research data on a large number of enterprise hard disks in the industry shows that the total amount of data written in the entire life cycle of the enterprise hard disks is limited. Assuming that a single disk write is less than 200TB, this means that the 600GB SSD disk can be used for more than 10 years.

So even though the SSD’s flash media can only be written thousands or tens of thousands of times, making flash INTO AN SSD can meet the needs of enterprise applications.

What factors determine SSD lifespan?


SSD lifespan generally refers to the wear life of the Nand Flash it uses. Other components in an SSD do not limit or limit the service life of the SSD. To prolong SSD wear life, most manufacturers use the following methods.

  • Over Provisioning: For example, a 100GB SSD has more than 100GB of internal flash memory. Enterprise-class SSDS can have 128GB or more. The excess is called redundant.
  • Use better parts: Such as better Nand Flash chips, better controller chips, and so on. For example, the maximum wipe times of SLC flash particles are better than that of MLC, which is better than that of TLC.

The SSD service life also depends on the redundancy, Flash type, and device of the flash memory. The more redundant the flash memory is, the longer its service life is. The better the particle type, the longer the life. The stronger the error correction ability, the longer the service life.

SSD longevity is not solely dependent on the type of flash memory, but rather on a combination of factors. Through abundant capacity redundancy, a powerful LDPC error correction algorithm can meet the needs of various enterprise applications.

What are the common causes of SSD failure?


The possible causes of SSD faults include flash media faults, hardware faults, and software faults. SSDs have no mechanical components compared with HDDs, so they have lower requirements in various aspects of the operating environment. An environment that can meet the normal operation of HDDs can fully meet the normal operation of SSDs. In addition, SSDs can give full play to their advantages of stability and reliability in vibration environments such as subways and ships, and maintain stable operation of services.

Can DATA be recovered after an SSD failure?


Although the principles of SSDs and HDDs are different, the faulty SSDS can be partially or completely recovered by troubleshooting faults in most scenarios. SSDs are similar to HDDs in this respect.

  1. When the flash media fails, if some of the flash media inside the SSD fails, causing the SSD to fail, the situation is similar to that of physical damage to disks inside HDDs. You can isolate the faulty part and recover other data.
  2. When other hardware fails, If an SSD fails due to the failure of other hardware components other than the flash media, you can partially or completely restore data by replacing the faulty hardware components.
  3. When the SSD software fails, you can upgrade the software to recover some or all data.
  4. In response to a data security wipe, it is not actually an SSD failure. After a full disk security wipe is performed on an SSD, the data cannot be recovered and the data security wipe for SSDs does not have to be physically destroyed as it must be for HDDs.

So much for today’s sharing. For more in-depth knowledge of flash, please refer to the previous detailed articles.

END.

Hi, I’m Leo Zhi, the Field Sales Engineer

With more than 10 years of experience in NAND flash storage, I’d love to share with you valuable knowledge from a Chinese supplier’s perspective.

Hope you like this article, and please share it or subscribe to our newsletter.

Leave a Reply

Your email address will not be published.

16 − 1 =

Please let us know what you require, and you will get our reply within 24 hours.

  • Our team will answer your inquiries within 24 hours.
  • Your information will be kept strictly confidential.

Let's Have A Chat

Learn How We Served 100+ Global Device Brands with our Products & Get Free Sample!!!

Email Popup Background 2