HDDs fail sometime (know which ones beforehand)

Did you notice how electronic equipment always happen to fail at the worst time? Well, does a ‘good’ moment for a failure exist anyway? This is why you, the latest data citizen, are often asked to find a solution to the following business problem: reduce failure cost related to the management of a pool of electronic devices by extracting early reliable failure signals from raw sensor data.

Let’s imagine you currently work for a cloud storage company, trying to anticipate failure of hard drives. You have a lot of data to play with. From there, it’s easy for you to define the Machine Learning solution: building early and reliable failure signals, then combine them within a classifier.

But, wait a minute. There are so many possible signals. In fact, you’re dealing with infinity. Don’t scrutinize infinity for too long ...


... and take a moment to read how PredicSis ML (Predicsis ML Developer) solutions can help.

The power to know beforehand

Mechanical hard drives are the raw material of your business. Your company has set up business processes in order to cost-effectively manage this resource: you have more than 55,000 of them, with a daily failure rate fluctuating around 0.02% (quite a difficult Machine Learning problem, isn’t it?). Fortunately, these drives are daily monitored using SMART (Self-Monitoring, Analysis, and Reporting Technology) monitoring system.

Using PredicSis ML, you are able to improve business process by automatically delivering every day the list of drives that are likely to fail during the next day. This way, your company can proactively and seamlessly act in order to reduce business impact on its service level agreement.

What PredicSis ML brings to you

PredicSis brings to you the ‘Ta-da moment’ instead of the ‘why should I face infinity one more time’ moment.

To get to this ‘Ta-da moment’, please follow these steps:

  1. Download hard drive raw data, a collection of daily log files recording about 45 SMART sensors per hard drive.

  2. Define the binary target variable answering the question ‘did the drive fail on the next day?’. This variable is derived from the ‘failure’ column available upfront in the dataset.

  3. Consider a training period and perform some data preparation to generate positive and negative cases:

    1. A case is labeled ‘Y’ when a drive failed during the training period. For such a case, keep the log data from the 30 days prior to failure.

    2. A case is labeled ‘N’ when a drive hasn’t failed during the training period. In that case, randomly pick a day within the training period and keep the log data from the 30 days prior to that day.

  4. Produce 2 flat files: one with a single row for each drive and 3 columns (drive identifier, model name, target) and one with 30 rows for each drive and 45 columns (SMART values).

  5. Run PredicSis ML and Ta-da!

PredicSis ML investigates infinity for you, by automatically combining standard operators (min, max, mean, mode, count, filter, …) with log data, controlling and preventing overfitting issue, by design. More on this later on. At the moment, let’s have a look at the performance.

This is (statistically) performing!

We took 2015 daily logs from January to the end of May as the training period. Historically, 4 drives fail each day on average. We ran our classifier to score each day of June. Focusing on the top 4 hard drives with highest probability of failures each day, we would have achieved a precision of 46% when observing failure rate over 30 days. That’s not too bad, especially when the average daily failure rate is less than 0.01%. We are talking about a x460 lift!

This is not a black box

PredicSis ML is not a black box. You, as a data citizen, are able to understand what kind of signals is driving failure.

For example, the maximal value taken by SMART attribute 197 over the past 30 days is a strong pre-failure signal. SMART 197 tracks current pending sector count. As long as its value stays at 0, there is hardly a chance the drive will fail over the next day. But once the value goes beyond 8, the drive will fail over next day with a 88% probability.


Business value from automated feature surfacing

Predicsis ML offers automated feature discovery that automatically generates and combines features from log data while preventing from overfitting the data. Again, for any data citizen, this drastically reduces the time dedicated to build a first bunch of features with high predictive power from raw data. This way, more time can be spent on adding business knowledge and integrating the model in the business process.