How to save precious time in data preparation

Let’s start with a quick question. 

Don’t you ever get tired of the pseudo-algorithm running your data citizen’s life? You know, the one that goes:

WHILE I_have_time = TRUE AND great_idea_for_new_feature_set IS NOT  empty
    Write and run queries to extract these new features
    Evaluate the statistical relevance of these new features
    Beware the two-headed dragon of over/underfitting
    Fight the GIGO rule
    Have great ideas about some new features

As a data citizen, up to 80% of your working time is dedicated to the data preparation step. And feature discovery takes a huge part of it. Chatting with the data owners, building predictive models, delivering proved RoI to your stakeholders, eating, sleeping, and other basic functions, fill in the remaining 20%. And it’s just plain sad.

So! Like myself, I know you sometimes wished you had a magic wand, transforming your raw data into informative and reliable features. 

Well, guess what. This wand is here already! 

Yeah, I know, this seems a little bit too good to be true, but before closing your tab with a big “Oumpf!”, give me some time to explain. Let’s see the magic trick in action, on a concrete use-case of aircraft maintenance (Remember that use case? Your boss told you to work on this for him and you even wrote an article about it).

The magic wand way and the spaghetti way

“Wait a moment: the ‘spaghetti way’?!? What are they talking about”, you say. You know what I’m talking about, look at this:


This one is from Microsoft, with Cortana Analytics. You know we could have had the same from SAS Entreprise Miner, RapidMiner and many others. 

I bet you got it, now. Truly looks like a plate of spaghetti. 

But let’s get back to business, erm to our use case: the raw data you are working on, the aircraft engine readings of sensors on hundreds of pieces of equipment. In this case, the spaghetti flow chart exposes (only!) 45 imbricated substeps of data preparation and feature engineering. And don't forget that something is missing to this spaghetti plate. Hidden is the delicate tomato sauce of exploratory analysis.

But let’s talk about those spaghettis: why these ones? Why specifically apply moving averages and standard deviations within the given time window of 5 cycles? Why averages? Why standard deviations? Why 5? Are these random choices? Certainly not! 

The key word here is “CHOICE”. Whatever is the procedure leading to these precise features and time window parameters, it’s for sure what occupies the major part of the greedy 80% of your time.

Now, it’s time to give a try to the electronic magic wand: Let’s see how it takes charge of automatically discovering a first bunch of reliable features.

Remember that you have sequences of sensor data from aircraft engines, each engine being labeled with one of the three classes of RUL (Remaining Useful Life, in cycles): [0; 14], [15; 30] and [31; + ∞[. You cannot avoid picking time windows, but you can provide with raw sequences of different lengths (1, 5, 10 and 20 cycles, for instance) and let the magic wand operate. From that, it automatically calculates averages, standard deviations, max, min, medians, sums, counts, etc., evaluates the predictive performance of the resulting features with respect to the 3 classes, and keeps the informative ones only.

While you were reading the last paragraph, we did the job for you and let the system work for … 14 seconds! Damn, it is fast! And that’s the point: let the machine automatically and quickly extract reliable features from raw data, bringing 80% of the performance. Then you can spend more time on what you are good at: using your intuition to grab the remaining 20%. 


Please, show me the details!

And I’ll respond: “With pleasure!”.  Firstly, let me show you the confusion matrix computed against the test dataset:


The performance is no less than 91% of accuracy, when Cortana announces 88% for the multiclass logistic regression model and 92% for the neural network model. And, please note the user “has chosen the optimal algorithm here by selecting a variety of options and evaluating the performance”, while we clicked on the “learning button”.

And, one more thing. is not like Schrödinger cat’s box.


It is transparent (and no cats are harmed).

The surfaced features are understandable by people like you, my dear data citizen. Let’s have a look at the first one among the 61 informative features, because I know how much you like pretty histograms!

The most informative feature is the average over 10 days of values from sensor 11. And now that you watch this histogram (beautiful, isn’t it?), I bet you can’t help wanting to tell a story from it.

The story goes like this.

From the right part of the discretization, I can see that the engine will certainly fail within 30 days when the average value of sensor 11 over 10 days is higher that 48.0245. I can tell my boss that he should watch out for raisings of the 11th sensor readings. His most probable answer will be “I know that already, but you, how do YOU know? You don’t even know what the 11th sensor is!”. “You are right, but please keep track of the readings for 10 days, and watch out for averages over 48.0245”. 

And that’s what we call a bingo. For 14 seconds of computer work. Sounds like a good bargain to me.


Maybe you are puzzled. Maybe you are wondering: “If it is about pressing a button, could a monkey do my job?”. 

Certainly not, and I am not flattering you. It is great news that the machine will now help you find the raw diamonds in your data and leave you with the great part of the job: cut these diamonds up, and just perform art jewelry.

P.S.: if you are burning to ask about the median of sensor 9 over 20 days, then please do ask below!

Intrigued? If you want to get a sense of, we made it easy for you through our demo page