#rstats #pydata #dataNerdFriendsOfAllFlavors
I'm building a binary classification model, and the data is really unbalanced (less than 2% of the data is in the class I want to predict)
tell me all your tips and tricks for dealing with data like this! What's your favorite model type to use in this case? Do you resample the data to change the balance?
here's a thought that someone has probably had before and I definitely don't want to implement myself: can we handle the inbalance in an ensemble method by sampling most of the trues, but a random set of the falses? so that the whole ensemble does use the full set of falses, but each tree has closer to balanced data?
I am sure this is super naive in some way, but I'd love to hear what folks think, and , if this is a thing already, what it's called
@ansate This may be a good starting point: https://stats.stackexchange.com/questions/235808/binary-classification-with-strongly-unbalanced-classes
@jospueyo thanks!
@ansate I do remember having to rebalance i.e. use all of the 2% and randomly select 1 in 50 of the 90/8%. Don't remember what I did next though...
@ansate Don't SMOTE.
If you are using stochastic gradient descent you could try down-sampling the majority class. But for a binary classifier trees are very good if you can cajole the data into a shape they can input.
@drgroftehauge interesting, my work colleague suggested SMOTE - I've never used it.
Interested in any related thoughts you wish to share.
@ansate You oversample the minority class and add noise. But the noise is disconnected from the modelling. You should achieve the same thing by regularising or constraining your model but in a more intuitive way.
@ansate My two dimes: It's important to use metrics that are appropriate to the problem, both in training and in evaluation. This is done by assigning utilities that make sense in your problem, to the various classification/misclassification possibilities. Then balance may not even be a problem.
Old and new papers on this problem, which approach it in a principled way:
- Drummond, Holte "Severe Class Imbalance: Why Better Algorithms Aren't the Answer" (https://doi.org/10.1007/11564096_52 and https://webdocs.cs.ualberta.ca/~holte/Publications)
- "Does the evaluation stand up to evaluation? A first-principle approach to the evaluation of classifiers" (https://doi.org/10.31219/osf.io/7rz8t)
@pglpm thank you! i love the paper links, very much appreciated!
@ansate whatever approach you take remember the following (Kuhn and Johnson, 2013)
@ansate
Make sure you recalibrate your predictions if you subsample (upsample or downsample), https://twitter.com/MaartenvSmeden/status/1495668297630633985
If you downsample, standard platt scaling recalibration may not work, https://arxiv.org/abs/2410.18144
Make sure subsampling happens in the training folds. If you're working within tidymodels or sklearn pipelines, they should take care of this. Just follow the documentation.
Don't use accuracy as a metric. F-Scores, MCC, PR-AUC, Cohen’s Kappa are better for imbalanced data.
@ansate
Some packages:
themis, https://themis.tidymodels.org/
ebmc, https://cran.r-project.org/web/packages/ebmc/index.html
imbalanced-learn, https://imbalanced-learn.org/stable/
@ansate I think that choosing an adequate loss function can help. You can try f-scores and the focal loss.
@ansate I've definitely done this, I don't remember the consequences (in terms of adjusting the predictions)
@ansate I think this is called random undersampling (with the complement of random oversampling where you bootstrap sample the trues to get "more" of them, i.e. include some of them more than once).
However, there's a paper suggesting attempting to correct class imbalance may not be a good idea: https://arxiv.org/abs/2202.09101
I think the crux of the paper is that imbalance per se isn't necessarily the problem, lack of information about the rare outcome is.
So if you have a small absolute number of trues, no amount of correction can give you (or your model) more information. But if you've got a decent sample of trues, imbalance isn't necessarily a problem.
@theta_max thank you! this is super helpful. I love that we can have these conversations here. :D
@ansate me too! I signed up for a "mathstodon" account but spend most of my time posting about running, d'oh
@ansate xgboost and lightgbm have a hyperparameter, scale_pos_weight, that would essentially do what I think you're describing, but as with up/downsampling, you'd need to recalibrate in order to make your probabilities interpretable.
https://xgboost.readthedocs.io/en/stable/tutorials/param_tuning.html#handle-imbalanced-dataset