Social.coop @SocialCoop

Recent searches

Search options

Only available when logged in.

#rstats #pydata #dataNerdFriendsOfAllFlavors

I'm building a binary classification model, and the data is really unbalanced (less than 2% of the data is in the class I want to predict)

tell me all your tips and tricks for dealing with data like this! What's your favorite model type to use in this case? Do you resample the data to change the balance?

Dec 16, 2024, 05:26 PM·

6boosts·3favorites

**Melissa Santos** @ansate · Dec 17, 2024

Dec 17, 2024

Melissa Santos @ansate

here's a thought that someone has probably had before and I definitely don't want to implement myself: can we handle the inbalance in an ensemble method by sampling most of the trues, but a random set of the falses? so that the whole ensemble does use the full set of falses, but each tree has closer to balanced data?

I am sure this is super naive in some way, but I'd love to hear what folks think, and , if this is a thing already, what it's called

#rstats #pydata

**Josep Pueyo-Ros** @jospueyo@fosstodon.org · Dec 16, 2024

Dec 16, 2024

Josep Pueyo-Ros @jospueyo@fosstodon.org

@ansate This may be a good starting point: https://stats.stackexchange.com/questions/235808/binary-classification-with-strongly-unbalanced-classes

Cross ValidatedBinary classification with strongly unbalanced classesI have a data set in the form of (features, binary output 0 or 1), but 1 happens pretty rarely, so just by always predicting 0, I get accuracy between 70% and 90% (depending on the particular data ...

**Melissa Santos** @ansate · Dec 16, 2024

Dec 16, 2024

Melissa Santos @ansate

@jospueyo thanks!

**Paul (Tex) Hewson** @texhewson@datasci.social · Dec 16, 2024

Dec 16, 2024

Paul (Tex) Hewson @texhewson@datasci.social

@ansate I do remember having to rebalance i.e. use all of the 2% and randomly select 1 in 50 of the 90/8%. Don't remember what I did next though...

**Morten Grøftehauge** @drgroftehauge@sigmoid.social · Dec 16, 2024

Dec 16, 2024

Morten Grøftehauge @drgroftehauge@sigmoid.social

@ansate Don't SMOTE.
If you are using stochastic gradient descent you could try down-sampling the majority class. But for a binary classifier trees are very good if you can cajole the data into a shape they can input.

**Melissa Santos** @ansate · Dec 16, 2024

Dec 16, 2024

Melissa Santos @ansate

@drgroftehauge interesting, my work colleague suggested SMOTE - I've never used it.

Interested in any related thoughts you wish to share.

**Morten Grøftehauge** @drgroftehauge@sigmoid.social · Dec 16, 2024

Dec 16, 2024

Morten Grøftehauge @drgroftehauge@sigmoid.social

@ansate You oversample the minority class and add noise. But the noise is disconnected from the modelling. You should achieve the same thing by regularising or constraining your model but in a more intuitive way.

**pglpm** @pglpm@c.im · Dec 16, 2024

Dec 16, 2024

pglpm @pglpm@c.im

@ansate My two dimes: It's important to use metrics that are appropriate to the problem, both in training and in evaluation. This is done by assigning utilities that make sense in your problem, to the various classification/misclassification possibilities. Then balance may not even be a problem.

Old and new papers on this problem, which approach it in a principled way:

- Drummond, Holte "Severe Class Imbalance: Why Better Algorithms Aren't the Answer" (https://doi.org/10.1007/11564096_52 and https://webdocs.cs.ualberta.ca/~holte/Publications)

- "Does the evaluation stand up to evaluation? A first-principle approach to the evaluation of classifiers" (https://doi.org/10.31219/osf.io/7rz8t)

**Melissa Santos** @ansate · Dec 16, 2024

Dec 16, 2024

Melissa Santos @ansate

@pglpm thank you! i love the paper links, very much appreciated!

**smroecker** @smroecker@fosstodon.org · Dec 16, 2024 *

Dec 16, 2024 *

smroecker @smroecker@fosstodon.org

@ansate whatever approach you take remember the following (Kuhn and Johnson, 2013)

**Eric Book** @erc_bk@fosstodon.org · Dec 16, 2024

Dec 16, 2024

Eric Book @erc_bk@fosstodon.org

@ansate
Make sure you recalibrate your predictions if you subsample (upsample or downsample), https://twitter.com/MaartenvSmeden/status/1495668297630633985
If you downsample, standard platt scaling recalibration may not work, https://arxiv.org/abs/2410.18144
Make sure subsampling happens in the training folds. If you're working within tidymodels or sklearn pipelines, they should take care of this. Just follow the documentation.
Don't use accuracy as a metric. F-Scores, MCC, PR-AUC, Cohen’s Kappa are better for imbalanced data.

X (formerly Twitter)Maarten van Smeden (@MaartenvSmeden) on XNEW PREPRINT The increasingly popular class imbalance approaches (such as SMOTE) for risk prediction modeling: they are likely to do more harm than good https://t.co/CJiH4bhloL

**Eric Book** @erc_bk@fosstodon.org · Dec 16, 2024

Dec 16, 2024

Eric Book @erc_bk@fosstodon.org

@ansate
Some packages:
themis, https://themis.tidymodels.org/
ebmc, https://cran.r-project.org/web/packages/ebmc/index.html
imbalanced-learn, https://imbalanced-learn.org/stable/

themis.tidymodels.orgExtra Recipes Steps for Dealing with Unbalanced DataA dataset with an uneven number of cases in each class is said to be unbalanced. Many models produce a subpar performance on unbalanced datasets. A dataset can be balanced by increasing the number of minority cases using SMOTE 2011 <arXiv:1106.1813>, BorderlineSMOTE 2005 <doi:10.1007/11538059_91> and ADASYN 2008 <https://ieeexplore.ieee.org/document/4633969>. Or by decreasing the number of majority cases using NearMiss 2003 <https://www.site.uottawa.ca/~nat/Workshop2003/jzhang.pdf> or Tomek link removal 1976 <https://ieeexplore.ieee.org/document/4309452>.

**David Quintero** @davidsuculum@mathstodon.xyz · Dec 16, 2024

Dec 16, 2024

David Quintero @davidsuculum@mathstodon.xyz

@ansate I think that choosing an adequate loss function can help. You can try f-scores and the focal loss.

**Paul (Tex) Hewson** @texhewson@datasci.social · Dec 17, 2024

Dec 17, 2024

Paul (Tex) Hewson @texhewson@datasci.social

@ansate I've definitely done this, I don't remember the consequences (in terms of adjusting the predictions)

**Tom** @theta_max@mathstodon.xyz · Dec 17, 2024

Dec 17, 2024

Tom @theta_max@mathstodon.xyz

@ansate I think this is called random undersampling (with the complement of random oversampling where you bootstrap sample the trues to get "more" of them, i.e. include some of them more than once).

However, there's a paper suggesting attempting to correct class imbalance may not be a good idea: https://arxiv.org/abs/2202.09101

I think the crux of the paper is that imbalance per se isn't necessarily the problem, lack of information about the rare outcome is.

So if you have a small absolute number of trues, no amount of correction can give you (or your model) more information. But if you've got a decent sample of trues, imbalance isn't necessarily a problem.

arXiv.orgThe harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regressionMethods to correct class imbalance, i.e. imbalance between the frequency of outcome events and non-events, are receiving increasing interest for developing prediction models. We examined the effect of imbalance correction on the performance of standard and penalized (ridge) logistic regression models in terms of discrimination, calibration, and classification. We examined random undersampling, random oversampling and SMOTE using Monte Carlo simulations and a case study on ovarian cancer diagnosis. The results indicated that all imbalance correction methods led to poor calibration (strong overestimation of the probability to belong to the minority class), but not to better discrimination in terms of the area under the receiver operating characteristic curve. Imbalance correction improved classification in terms of sensitivity and specificity, but similar results were obtained by shifting the probability threshold instead. Our study shows that outcome imbalance is not a problem in itself, and that imbalance correction may even worsen model performance.

**Melissa Santos** @ansate · Dec 17, 2024

Dec 17, 2024

Melissa Santos @ansate

@theta_max thank you! this is super helpful. I love that we can have these conversations here. :D

**Tom** @theta_max@mathstodon.xyz · Dec 17, 2024 *

Dec 17, 2024 *

Tom @theta_max@mathstodon.xyz

@ansate me too! I signed up for a "mathstodon" account but spend most of my time posting about running, d'oh

**Eric Book** @erc_bk@fosstodon.org · Dec 17, 2024

Dec 17, 2024

Eric Book @erc_bk@fosstodon.org

@ansate xgboost and lightgbm have a hyperparameter, scale_pos_weight, that would essentially do what I think you're describing, but as with up/downsampling, you'd need to recalibrate in order to make your probabilities interpretable.
https://xgboost.readthedocs.io/en/stable/tutorials/param_tuning.html#handle-imbalanced-dataset

xgboost.readthedocs.ioNotes on Parameter Tuning — xgboost 2.1.1 documentation

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Back