Go to Top

MTBF: Hard drive failure prediction?

At Kroll Ontrack, we’re well aware that data loss can affect anyone. For many of us, it comes in the form of hard disk drive (HDD) failure; mechanical and electronic defects that render the information stored therein unreadable. There are dozens of possible causes for this type of malfunction, ranging from logical software errors to physical damage. We can’t forget to mention that all storage devices have a finite lifespan.

Most of us could name some of the tell-tale signs that a hard drive is about to fail. For example, if your HDD shifts from a pleasant whirring noise to grinding, it’s a safe bet that it’s about to quit. In addition, if access to data seems to slow or it starts acting strange (corrupted data and missing files) are reliable indicators of hard drive failure.

Unfortunately, these aren’t what you’d call scientific metrics to detect a HDD malfunction. While it’s one thing to watch for oddities from your individual laptop or tower, it’s another to apply the same methodology to a redundant array of independent disks (RAID) environment in a remote data center.

So how can consumer and business users predict when their hard drives are about to fail? Well, their first step might be to check the manufacturers’ estimates of their storage device lifespans. These estimates are usually listed as a mean time between failures (MTBF) rating. This is a common benchmark for hard drives, but what does it really mean and how is it calculated?

What is mean time between failures? 

The MTBF rating is just as it sounds, the average period of time between one inherent failure and the next in a single component’s lifespan. In other words, if a machine or part malfunctions and is afterwards repaired, its MTBF figure is the number of hours it can be expected to run as normal before it breaks down again.

With consumer hard drives, it’s not uncommon to see MTBFs of around 300,000 hours. That’s 12,500 days, or a little over 34 years. Meanwhile, enterprise-grade HDDs advertise MTBFs of up to 1.5 million hours, which is the best part of 175 years. Could you imagine if these MTBFs were real-world expectations of hard drive longevity and reliability? It would be an IT manager’s dream come true!

Unfortunately, there’s a variance between the MTBF metric and real-world lifespans. The MTBF metric has a long and distinguished lineage in military and aerospace engineering. The figures are derived from error rates in statistically significant number of drives running for weeks or months at a time.

Corresponding, studies have demonstrated that MTBFs typically promise much lower failure rates than actually occur in real-world performance. In 2007, researchers at Carnegie Mellon University investigated a sample of 100,000 HDDs with manufacturer-provided MTBF ranges of one million to 1.5 million hours. This translates to an annual failure rate (AFR) of 0.88 percent, but their study found that AFRs in the field “typically exceed one percent, with two to four percent common and up to 13 percent observed in some systems”.

Manufacturers aren’t turning a blind eye to this discrepancy. Recently, both Seagate and Western Digital have phased out using the metric for their HDDs.

So with MTBFs proven to be an unreliable indicator of hard drive health, how else can we predict the end of a storage device’s lifespan? In our next blog, we’ll discuss the pros and cons of using SMART tools to detect when a HDD is on the verge of quitting.

One Response to "MTBF: Hard drive failure prediction?"

  • William Hagen
    20th May 2015 - 8:51 pm Reply

    MTBF is meaningful only when an item has a constant failure rate, i.e. the failures are exponentially distributed.

    Hard disk drives are primarily mechanical devices, with mechanical failure modes. Mechanical failure modes tend to have failure modes which are normally distributed.

    If we suppose an application which used a large number of hard disk drives, and the failures were exponentially distributed the number of failures in any two intervals of the same size would be the same. The hard drive would be just as likely to failure on the 100th day as on the 10,000th day.

    HDDs in actual use exhibit wearout. Very few failures will occur after an initial ‘infant mortality’ phase until some point in time where the failure rate rapidly increases.

    For typical mechanical HDDs the wearout point is at from 3 to 5 years of constant operation.

Leave a Reply

Your email address will not be published. Required fields are marked *