Pytorch - 如何使用 weightedrandomsampler 进行欠采样 [英] Pytorch - how to undersample using weightedrandomsampler

查看:1141
本文介绍了Pytorch - 如何使用 weightedrandomsampler 进行欠采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个不平衡的数据集,想对代表性过高的类进行抽样不足.我该怎么做.我想使用 weightedrandomsampler,但我也愿意接受其他建议.

I have an unbalanced dataset and would like to undersample the class that is overrepresented.How do I go about it. I would like to use to weightedrandomsampler but I am also open to other suggestions.

到目前为止,我假设我的代码必须具有如下结构.但我不知道该怎么做.

So far I am assuming that my code will have to be structured kind of like the following. But I dont know how to exaclty do it.

<代码>trainset = datasets.ImageFolder(path_train,transform=transform)...sampler = data.WeightedRandomSampler(weights=..., num_samples=..., replacement=...)...trainloader = data.DataLoader(trainset, batchsize = batchsize, sampler=sampler)

希望有人能帮忙.非常感谢

I hope someone can help. Thanks a lot

推荐答案

根据我的理解,pytorch WeightedRandomSampler 'weights' 参数有点类似于 numpy.random.choice 'p' 参数,即样本随机获得的概率被选中.Pytorch 使用权重来随机抽样训练示例,并且他们在文档中声明权重之和不必为 1,这就是我的意思,这与 numpy 的随机选择并不完全一样.权重越大,样本被采样的可能性就越大.

From my understanding, pytorch WeightedRandomSampler 'weights' argument is somewhat similar to numpy.random.choice 'p' argument which is the probability that a sample will get randomly selected. Pytorch uses weights instead to random sample training examples and they state in the doc that the weights don't have to sum to 1 so that's what I mean that it's not exactly like numpy's random choice. The stronger the weight, the more likely that sample will get sampled.

当你有replacement=True时,这意味着可以多次绘制训练样例,这意味着你可以在你的训练集中有训练样例的副本,用于训练你的模型;过采样.此外,如果权重与其他训练样本权重相比较低,则会发生相反的情况,这意味着这些样本被随机抽样的机会较低;欠采样.

When you have replacement=True, it means that training examples can be drawn more than once which means you can have copies of training examples in your train set that get used to train your model; oversampling. Alongside, if the weights are low COMPARED TO THE OTHER TRAINING SAMPLE WEIGHTS the opposite occurs which means that those samples have a lower chance of being selected for random sampling; undersampling.

我不知道将 num_samples 参数与火车装载机一起使用时它是如何工作的,但我可以警告您不要将批量大小放在那里.今天,我尝试放置批量大小,但结果很糟糕.我的同事把班级数*100,他的结果要好得多.我所知道的是你不应该把批量大小放在那里.我还尝试将所有训练数据的大小放在 num_samples 中,结果更好,但需要花费很长时间来训练.无论哪种方式,玩弄它,看看什么最适合你.我想安全的赌注是将训练示例的数量用于 num_samples 参数.

I have no clue how the num_samples argument works when using it with the train loader but I can warn you to NOT put your batch size there. Today, I tried putting the batch size and it gave horrible results. My co-worker put the number of classes*100 and his results were much better. All I know is that you should not put the batch size there. I also tried putting the size of all my training data for num_samples and it had better results but took forever to train. Either way, play around with it and see what works best for you. I would guess that the safe bet is to use the number of training examples for the num_samples argument.

这是我看到其他人使用的示例,我也将其用于二元分类.它似乎工作得很好.您取每个类别的训练示例数量的倒数,然后使用该类别的相应权重设置所有训练示例.

Here's the example I saw somebody else use and I use it as well for binary classification. It seems to work just fine. You take the inverse of the number of training examples for each class and you set all training examples with that class its respective weight.

使用 trainset 对象的快速示例

A quick example using your trainset object

labels = np.array(trainset.samples)[:,1] # 转到数组并取所有列索引 1 的标签

labels = np.array(trainset.samples)[:,1] # turn to array and take all of column index 1 which are the labels

labels = labels.astype(int) # 改为 int

majority_weight = 1/num_of_majority_class_training_examples

minority_weight = 1/num_of_minority_class_training_examples

sample_weights = np.array([majority_weight, minor_weight]) # 这假设你的少数类是标签对象中的整数 1.如果不是,请交换位置,使其成为少数权重、多数权重.

sample_weights = np.array([majority_weight, minority_weight]) # This is assuming that your minority class is the integer 1 in the labels object. If not, switch places so it's minority_weight, majority_weight.

weights = samples_weights[labels] # 这会遍历每个训练示例,并使用标签 0 和 1 作为 sample_weights 对象中的索引,这是您想要该类的权重.

weights = samples_weights[labels] # this goes through each training example and uses the labels 0 and 1 as the index in sample_weights object which is the weight you want for that class.

sampler = WeightedRandomSampler(weights=weights, num_samples=, replacement=True)

trainloader = data.DataLoader(trainset, batchsize = batchsize, sampler=sampler)

由于 pytorch 文档说权重总和不必为 1,我认为您也可以使用不平衡类之间的比率.例如,如果您有 100 个多数类的训练示例和 50 个少数类的训练示例,则比例为 2:1.为了抵消这一点,我认为您可以对每个多数类训练示例使用 1.0 的权重,对所有少数类训练示例使用 2.0 的权重,因为从技术上讲,您希望少数类被选中的可能性增加 2 倍,这将平衡您的随机选择期间的类.

Since the pytorch doc says that the weights don't have to sum to 1, I think you can also just use the ratio which between the imbalanced classes. For example, if you had 100 training examples of the majority class and 50 training examples of the minority class, it would be a 2:1 ratio. To counterbalance this, I think you can just use a weight of 1.0 for each majority class training example and a weight 2.0 for all minority class training examples because technically you want the minority class to be 2 times more likely to be selected which would balance your classes during random selection.

我希望这会有所帮助.很抱歉我写的草率,我很着急,看到没有人回答.我自己也在努力解决这个问题,但也找不到任何帮助.如果它没有意义,就说出来,我会重新编辑它,并在我有空的时候说得更清楚.

I hope this helped a little bit. Sorry for the sloppy writing, I was in a huge rush and saw that nobody answered. I struggled through this myself without being able to find any help for it either. If it doesn't make sense just say so and I'll re-edit it and make it more clear when I get free time.

这篇关于Pytorch - 如何使用 weightedrandomsampler 进行欠采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆