In just a few years Data Science, Machine Learning (ML) and now Deep Learning, have become very important business tools for data-driven industries. Tensorflow is now one of the most famous Deep Learning frameworks based on Deep Neural Networks (DNN). DDN would appear to be capable of managing any kind of Data Science and Artificial Intelligence problem that has one or multiple “deep” patterns: image recognition using Convolutional Neural Networks (CNN), time series or Natural Language Processing using Recurrent and Recursive Neural Networks (RNN).
Even if DNN has huge potential for application to all kinds of problems, it also has some significant issues of its own (as with most other ML algorithms): models will fail to manage uncorrelated features; by design, models will over fit and learn from irrelevant, noisy or rare data; and so on. In this post, we propose to pre-process data to automatically reject uncorrelated features and let DNN work on only statistically relevant and optimised data.
Using an Automatic Machine Learning pre-processing algorithm designed by PredicSis, we can improve and accelerate Tensorflow modelling.
In the following example we compare metrics: accuracy, loss and AUC results, over three classical data sets: US Census, Breast Cancer and Glass. We run a first RNN model over raw data and a second over relevant and optimised data.
We quickly observe that RNN models are more accurate and less noise sensitive using pre-processed data. Below we describe the data used, the protocol and pre-processing algorithm, and finally the results.
Let’s start by introducing our data sets: The US Adult Income Census data set, which consists of about 48,000 rows and 14 fields:
- 6 are continuous: age, fnlwgt, education-number, capital-gain, capital-loss, hours-per-week
- 7 are categorical: education, marital-status, occupation, relationship, race, sex, native-country
- The target is to predict the income bracket
The Breast Cancer data set is made up of a little under 600 individual patient test results. Each test result has 29 columns and our goal is to predict if the patient has breast cancer. All variables are numeric. The data was chosen because it consisted of nearly all continuous variables.
The goal of the Glass data set is to predict the type of glass based on 9 variables, such as the concentration of Iron, Calcium, etc. and the Refractive Index. There are 7 different types of glass.
PredicSis.ai is an Automatic (Classifier) Machine Learning solution designed to avoid overfitting and provide truly robust models. The solution is designed using an efficient process that manages relational data, outlined in the following three steps:
- Features are automatically generated from aggregated relational data (scientific paper)
- A non-parametric (MODL) algorithm selects and optimises the features (scientific paper)
- An ensemble learning algorithm like Random Forest or others is applied to the features (scientific paper)
In this post, we only use the non-parametric algorithm to select and optimise features. Without going into the mathematics behind this algorithm, the main idea is to find the best discretisation or the best partition of each feature such that it is the best compromise between prior (under fitting) and likelihood (over fitting).
From the DNN part of the tutorial (see link here) we extract tricks and feature combinations such that we train the DNN over “unprepared” data. The DNN are trained over 200 steps (just enough to converge on the data). We do not use all the tricks about feature combinations from the tutorial but try to benchmark the DNN models against raw data and automatically optimised data. We compare the results using the metrics: accuracy, AUC and loss.
This can be resumed by the following diagram:
We obtain the following results:
In the three experiments the obtained models are always more accurate with PredicSis.ai. Every 20 steps we can quickly observe that the models are more accurate and focussed using PredicSis.ai. They are also less disturbed by raw data noise. (The following diagrams feature only evaluation data.)
Loss gives a similar step-by-step view using trained and evaluation data:
It should be noted that these results have been achieved without any feature engineering or trick whatsoever. By using PredicSis MODL for pre-processing optimisation, it is possible to reduce feature noise and hugely accelerate the learning. It also increases the quality of predictions among the data sets. We can obtain all of these benefits without any constraints or zero hyper-parameters; and obtain optimised results through automatic machine learning.
We have shown that a Standard Neural Network gives good results over classical structured data. In a follow-up article we will try to improve image recognition or maybe natural language processing using this same technique.
If you want to learn more about this experiment please leave a comment or contact us. See you soon for another data experiment!
Intrigued? To get a sense of what we do at PredicSis.ai, please visit our Free Trial page