Earning Data Science Gold - Kaggle’s “ICR - Identifying Age-Related Conditions” Competition
This past week, the Kaggle Competition “ICR - Identifying Age-Related Conditions” finished up it’s 3 month run to improve classification of age-related conditions in anonymized patient data. I joined the competition with 8 days remaining, made 6 submissions (max 1 per day) and finished in 8th place among 6712 Teams comprised of researchers, PhDs, data science practitioners, and enthusiasts of all types. Gold Medals were awarded to the top 22 teams overall.
If you’re unfamiliar with Kaggle, I suggest you check it out. I’ve been a fan of Kaggle for years, though I’ve had limited time to spend on entries and competitions. There are introductory courses to data science and machine learning that can greatly enhance your ability to derive insights from data. You’ll find hundreds of data sets for research and exploration. They provide a host of competitions to enter, some for learning, some for swag, and some for cash prizes (such as this one) that are hosted by companies and research institutions. Should you find a competition that peaks your interest, but has already ended, you can enter, make submissions, and check your work against the competitors anyway. Bonus - it’s fun!
For this challenge, the community was tasked with classifying whether particular patients were presenting with an age-related condition based on 50+ health characteristic parameters that had been completely anonymized. We had no indication what each parameter meant throughout the event, working purely against the values in the data set and a small set of supplementary data referred to by Greek letters.
My Solution in-brief
I focused first on predicting if someone had a specific Age-Related Condition rather than simply the binary class. This proved more effective than predicting class alone. I create an Ensemble of Ensemble Predictors focused on specific conditions. The models in the primary ensembles were XGBoost & TabPFN.
Imputing Data:
Rather than imputing with medians, fixed values, or dropping empty data (either rows or columns), I employed XGBoost to learn & impute each field for missing values. This produced a much improved result as averaged between public and private scoring. Note: Public scoring is available throughout the competition while the private score is only shared at the end of the competition. Private scoring is used for final placements, medals, and cash prizes.
Predicting the Class.
Within the Greeks Data Set, we were provided with 3 distinct conditions that an individual might present with. Rather than simply predicting if someone had any condition at all up front, I decided to build predictors for each condition individually and combine these with an overall model. My hypothesis was that by combining these predictors, I might have an overall improved result.
Effectively, I took the maximum for the positive predictions of each condition and averaged that with the prediction from the general class predictor.
My Solution Write-up on Kaggle’s Forums
You can find more details of the solution and a more extensive write-up here: https://www.kaggle.com/competitions/icr-identify-age-related-conditions/discussion/430897
Data Science and ML w/o the Credentials.
You don’t need to have a degree in data science, nor a long history working in ML or AI to make use of the tools or to gain valuable expertise. Data Analysis is an essential skill and great product managers should be capable of deriving insights along side their partners in Data Science (when you’re lucky enough to have them, and on your own, when you’re not).