social.coop is one of the many independent Mastodon servers you can use to participate in the fediverse.
A Fediverse instance for people interested in cooperative and collective projects. If you are interested in joining our community, please apply at https://join.social.coop/registration-form.html.

Administered by:

Server stats:

497
active users

Melissa Santos

I'm building a binary classification model, and the data is really unbalanced (less than 2% of the data is in the class I want to predict)

tell me all your tips and tricks for dealing with data like this! What's your favorite model type to use in this case? Do you resample the data to change the balance?

here's a thought that someone has probably had before and I definitely don't want to implement myself: can we handle the inbalance in an ensemble method by sampling most of the trues, but a random set of the falses? so that the whole ensemble does use the full set of falses, but each tree has closer to balanced data?

I am sure this is super naive in some way, but I'd love to hear what folks think, and , if this is a thing already, what it's called

@ansate I do remember having to rebalance i.e. use all of the 2% and randomly select 1 in 50 of the 90/8%. Don't remember what I did next though...

@ansate Don't SMOTE.
If you are using stochastic gradient descent you could try down-sampling the majority class. But for a binary classifier trees are very good if you can cajole the data into a shape they can input.

@drgroftehauge interesting, my work colleague suggested SMOTE - I've never used it.

Interested in any related thoughts you wish to share.

@ansate You oversample the minority class and add noise. But the noise is disconnected from the modelling. You should achieve the same thing by regularising or constraining your model but in a more intuitive way.

@ansate My two dimes: It's important to use metrics that are appropriate to the problem, both in training and in evaluation. This is done by assigning utilities that make sense in your problem, to the various classification/misclassification possibilities. Then balance may not even be a problem.

Old and new papers on this problem, which approach it in a principled way:

- Drummond, Holte "Severe Class Imbalance: Why Better Algorithms Aren't the Answer" (doi.org/10.1007/11564096_52 and webdocs.cs.ualberta.ca/~holte/)

- "Does the evaluation stand up to evaluation? A first-principle approach to the evaluation of classifiers" (doi.org/10.31219/osf.io/7rz8t)

@pglpm thank you! i love the paper links, very much appreciated!

@ansate whatever approach you take remember the following (Kuhn and Johnson, 2013)

@ansate
Make sure you recalibrate your predictions if you subsample (upsample or downsample), twitter.com/MaartenvSmeden/sta
If you downsample, standard platt scaling recalibration may not work, arxiv.org/abs/2410.18144
Make sure subsampling happens in the training folds. If you're working within tidymodels or sklearn pipelines, they should take care of this. Just follow the documentation.
Don't use accuracy as a metric. F-Scores, MCC, PR-AUC, Cohen’s Kappa are better for imbalanced data.

X (formerly Twitter)Maarten van Smeden (@MaartenvSmeden) on XNEW PREPRINT The increasingly popular class imbalance approaches (such as SMOTE) for risk prediction modeling: they are likely to do more harm than good https://t.co/CJiH4bhloL

@ansate I think that choosing an adequate loss function can help. You can try f-scores and the focal loss.

@ansate I've definitely done this, I don't remember the consequences (in terms of adjusting the predictions)

@ansate I think this is called random undersampling (with the complement of random oversampling where you bootstrap sample the trues to get "more" of them, i.e. include some of them more than once).

However, there's a paper suggesting attempting to correct class imbalance may not be a good idea: arxiv.org/abs/2202.09101

I think the crux of the paper is that imbalance per se isn't necessarily the problem, lack of information about the rare outcome is.

So if you have a small absolute number of trues, no amount of correction can give you (or your model) more information. But if you've got a decent sample of trues, imbalance isn't necessarily a problem.

arXiv logo
arXiv.orgThe harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regressionMethods to correct class imbalance, i.e. imbalance between the frequency of outcome events and non-events, are receiving increasing interest for developing prediction models. We examined the effect of imbalance correction on the performance of standard and penalized (ridge) logistic regression models in terms of discrimination, calibration, and classification. We examined random undersampling, random oversampling and SMOTE using Monte Carlo simulations and a case study on ovarian cancer diagnosis. The results indicated that all imbalance correction methods led to poor calibration (strong overestimation of the probability to belong to the minority class), but not to better discrimination in terms of the area under the receiver operating characteristic curve. Imbalance correction improved classification in terms of sensitivity and specificity, but similar results were obtained by shifting the probability threshold instead. Our study shows that outcome imbalance is not a problem in itself, and that imbalance correction may even worsen model performance.

@theta_max thank you! this is super helpful. I love that we can have these conversations here. :D

@ansate me too! I signed up for a "mathstodon" account but spend most of my time posting about running, d'oh

@ansate xgboost and lightgbm have a hyperparameter, scale_pos_weight, that would essentially do what I think you're describing, but as with up/downsampling, you'd need to recalibrate in order to make your probabilities interpretable.
xgboost.readthedocs.io/en/stab

xgboost.readthedocs.ioNotes on Parameter Tuning — xgboost 2.1.1 documentation