Google Prediction vs. PredicSis API

Did you watch this video? If not, go and watch it!

… Done? So now, here are some explanations about the adopted protocol and the results. We randomly select 2 data sets from well-known sources of machine learning problems: sigkdd.org/kdd-cup-2008-breast-cancer and kaggle.com/c/GiveMeSomeCredit/data.

Kaggle give me some credit challenge

Understanding the data

From the Give Me Some Credit Kaggle challenge website, the competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. The training set consists of a total of 150,000 cases, 6.7% of which experienced financial distress. The cases are described by 10 features.

Preparing the data

Kaggle provides the competitors with a single data file: cs-training.csv. To comply with Google Prediction API requirements, we recode the target values from 1 / 0 to « True » / « False », we drop the first column, which is a primary key, and we delete the header line. We randomly select 90% of the rows in order to build the training set. We end up with a training set file made of 134,921 rows and 11 columns. We build a test set file by randomly picking 5,000 rows from the remaining ones, dropping the target and primary key columns, deleting the header line. We make use of « , » as the field separator.

Building the model

We build two predictive models: the first one with PredicSis API and the second one with Google Prediction API. It takes Google Prediction API 76.09 seconds versus 17.11 seconds for PredicSis API, which is 4.4 times faster.

Applying the model

We apply the models to the test data in order to compare the known labels with the estimated ones. We monitor the wall-clock-time needed to deliver the estimated labels. It takes Google Prediction API 369.04 seconds versus 4.51 seconds for PredicSis API. 81 times faster! We compare the two models through the Area Under the ROC Curve indicator. Never heard about ROC? Wikipedia has the answer. The model from Google API has a 0.743 AUC, while the model from PredicSis API has a 0.858 AUC. That is an absolute 15.5% improvement. Moreover, it takes more than 500 teams and 6000 submissions from kagglers to grab a 1.35% increase of the performance.

KDD cup 2008: Breast cancer

Preparing the data

The challenge focuses on the problem of early detection of breast cancer from X-ray images of the breast. Each image is represented by several candidates. The training set consists of a total of 102,294 candidate, but only an extremely small fraction of these 102,294 candidates is actually malignant. For each candidate, the image ID and the patient ID, several features, and a class label indicating whether or not it is malignant are provided. Features computed from several standard image processing algorithms - 117 in all - are provided as well.

Preparing the data

KDD Cup 2008 provides the user with two data files: info.txt and features.txt. The first thing we perform is an inner join to build a single file with 128 columns and 102,294 rows. The target values are recoded from 1 / 0 to « True » / « False » and we drop 4 columns that we consider as identity columns. We randomly select 90% of the rows in order to build the training set. The output data file has no header line and makes use of the « , » as field separator. We end up with a training set file made of 92,029 rows and 124 columns. We built a test set file by randomly picking 5,000 rows from the remaining ones, dropping the target and identity columns, deleting the header line and using « , » as the field separator.

Building the model

We build a classifier through both API. We monitor the wall-clock time. It takes Google Prediction API 125.44 seconds to deliver the model, to be compared with the 156.23 seconds needed by PredicSis API.

Applying the model

We apply the model to the test data in order to compare the known labels with the estimated ones. We monitor the wall-clock-time one more time. It takes Google Prediction API 421.58 seconds to label the instances. It takes PredicSis API 19.15 seconds. Yes, 22 times faster. From a statistical point of view, there are two even more interesting things. Firstly, the model built with Google Prediction API is a black-box one: no insight is made available about the features driving the estimations. Secondly, the statistical performance of the resulting model is very poor. Indeed, the model built with Google Prediction API always labels a new instance as benign: the malignant class is not populated enough to be detected by Google’s algorithm. PredicSis API does not suffer from such curses: it provides the user with relevant information about the importance of every feature and it is able to detect reliable patterns within highly unbalanced data sets.

Playground

Don't take our words for it and try running your benchmark!