Technical Opinions Somewhat Held: Results of the ICR Kaggle Competition

The ICR (In-vitro Cell Research) Identifying Age-related Conditions Kaggle competition is over.

My submission finished 4745 out of 6713 submissions, with a log loss (the metric used in the competetion) = 0.88. The winning submission had a log loss of 0.30 (the lower the better).

My fundamental error was using straight log loss, rather than balanced log loss. I did not notice that the competition specified balanced, which means the two class outcomes (develop an age-related condition or not) should be balanced. The training data was very unbalanced. My bad.

But I would not have done much better, I think. The provided data set was very small - only 617 observations. And after taking out some for a test set, as well as a calibration set (in order to calibrate predicted probabailities - see below), I had only 346 observations for training models. I could have tried using a somewhat larger training set at the cost of a smaller test set, but I didn't take the time to do so.

With such a small data set, my models were probably overfitted. I tried to address that by calibrating probabilities coming out of the model predictions, and using synthetic data. I did not cross-validate very much, which is something that may have helped. But how much - there was so little data provided.

While I did address the small data set by training models on synthetic data derived from the provided data set, and indeed the models trained on the additional synthetic data seemed to do much better, the library I used to generate the synthetic data (synthpop) is not available on Kaggle. My attempt to install it failed, as Kaggle did not have a supported version of a C compiler. So, I was unable to use my best (or what I thought was my best) model for my competition submission.

Basically, I pre-preprocessed the data and then trained several models on the given data and a few on the synthetic data. I kept track of the results, and tried the best (lowest log loss on the test data set) in the competiton.

Pro-processing

Identify highly correlated features and remove the redundant ones.
Replace NAs (no data available) with mean of remaining feature values (called feature imputation).
For random forest model, train on a subset of the most important features only.

Models trained on provided data

Models trained on synthetic data

Random Forest
Lindear Discriminant Analysis
Support Vector Machine

For the boosted tree model, I supplied a tuning grid to allow the library try many combinations of model parameters and choose the best one. That training took 1 hour 25 minutes, while the other models trained very quickly, given the very small amount of data to train on.

For all models, I applied isotonic regression to calibrate the predicted probabilities to see how much that helped (if at all).

The results I obtained:

The SVM model using a radial kernel were clearly better, whether trained on synthetic (best) or the provided data. I tuned gamma to 0.0009 using trial and error. Random forest did well on synthetic data, but was worst than all but boosted tree on the smaller provided data set (and even worse when trained on a subset of the supposedly most important features). The very worst was boosted tree, despite taking almost an hour and a half to find the "best" model parameters. Calibration usually helped, but not by all that much and not always. That may be due to the small data set.

While I didn't do well in the competition (I didn't expect to), I learned refreshed my familiarity with machine learning in R, and also learned much about Kaggle competitions.

The entire R markdown file, if interested, may be found here.

Technical Opinions Somewhat Held

Saturday, August 12, 2023

Results of the ICR Kaggle Competition

No comments:

Post a Comment

Police Traffic Stop Bias Analysis

Report Abuse