The ICR (In-vitro Cell Research) Identifying Age-related Conditions Kaggle competition is over.
My submission finished 4745 out of 6713 submissions, with a log loss (the metric used in the competetion) = 0.88. The winning submission had a log loss of 0.30 (the lower the better).
My fundamental error was using straight log loss, rather than balanced log loss. I did not notice that the competition specified balanced, which means the two class outcomes (develop an age-related condition or not) should be balanced. The training data was very unbalanced. My bad.
But I would not have done much better, I think. The provided data set was very small - only 617 observations. And after taking out some for a test set, as well as a calibration set (in order to calibrate predicted probabailities - see below), I had only 346 observations for training models. I could have tried using a somewhat larger training set at the cost of a smaller test set, but I didn't take the time to do so.
With such a small data set, my models were probably overfitted. I tried to address that by calibrating probabilities coming out of the model predictions, and using synthetic data. I did not cross-validate very much, which is something that may have helped. But how much - there was so little data provided.
While I did address the small data set by training models on synthetic data derived from the provided data set, and indeed the models trained on the additional synthetic data seemed to do much better, the library I used to generate the synthetic data (synthpop) is not available on Kaggle. My attempt to install it failed, as Kaggle did not have a supported version of a C compiler. So, I was unable to use my best (or what I thought was my best) model for my competition submission.
Basically, I pre-preprocessed the data and then trained several models on the given data and a few on the synthetic data. I kept track of the results, and tried the best (lowest log loss on the test data set) in the competiton.
- Pro-processing
- Identify highly correlated features and remove the redundant ones.
- Replace NAs (no data available) with mean of remaining feature values (called feature imputation).
- For random forest model, train on a subset of the most important features only.
- Models trained on provided data
- Random Forest (RF)
- Support Vector Machine (SVM)
- Boosted Tree (BT)
- K Nearest Neighbor (KNN)
- Linear Discriminant Analysis (LDA)
- Models trained on synthetic data
- Random Forest
- Lindear Discriminant Analysis
- Support Vector Machine
No comments:
Post a Comment