Scikit学习平衡子采样 [英] Scikit-learn balanced subsampling
问题描述
我正在尝试为我的大型不平衡数据集创建N个平衡随机子样本.有没有一种方法可以简单地通过scikit-learn/pandas来做到这一点,或者我必须自己实现它?是否有执行此操作的代码指针?
I'm trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it myself? Any pointers to code that does this?
这些子样本应该是随机的,并且在我将每个样本送入非常大的分类器集合中的单独分类器时可以重叠.
These subsamples should be random and can be overlapping as I feed each to separate classifier in a very large ensemble of classifiers.
在Weka中,有一个名为spreadsubsample的工具,在sklearn中是否等效? http://wiki.pentaho.com/display/DATAMINING/SpreadSubsample
In Weka there is tool called spreadsubsample, is there equivalent in sklearn? http://wiki.pentaho.com/display/DATAMINING/SpreadSubsample
(我知道加权,但这不是我想要的.)
(I know about weighting but that's not what I'm looking for.)
推荐答案
这是我的第一个似乎运行良好的版本,可以随时复制或就如何提高效率提出建议(我有很长的经验一般而言,但使用python或numpy的时间不长)
Here is my first version that seems to be working fine, feel free to copy or make suggestions on how it could be more efficient (I have quite a long experience with programming in general but not that long with python or numpy)
此函数创建单个随机平衡子样本.
This function creates single random balanced subsample.
子样本大小现在可以对少数族裔类别进行采样,应该更改它.
edit: The subsample size now samples down minority classes, this should probably be changed.
def balanced_subsample(x,y,subsample_size=1.0):
class_xs = []
min_elems = None
for yi in np.unique(y):
elems = x[(y == yi)]
class_xs.append((yi, elems))
if min_elems == None or elems.shape[0] < min_elems:
min_elems = elems.shape[0]
use_elems = min_elems
if subsample_size < 1:
use_elems = int(min_elems*subsample_size)
xs = []
ys = []
for ci,this_xs in class_xs:
if len(this_xs) > use_elems:
np.random.shuffle(this_xs)
x_ = this_xs[:use_elems]
y_ = np.empty(use_elems)
y_.fill(ci)
xs.append(x_)
ys.append(y_)
xs = np.concatenate(xs)
ys = np.concatenate(ys)
return xs,ys
对于试图通过Pandas DataFrame进行上述操作的任何人,您都需要进行一些更改:
For anyone trying to make the above work with a Pandas DataFrame, you need to make a couple of changes:
-
用
this_xs = this_xs.reindex(np.random.permutation(this_xs.index))
用
xs = pd.concat(xs)
ys = pd.Series(data=np.concatenate(ys),name='target')
xs = pd.concat(xs)
ys = pd.Series(data=np.concatenate(ys),name='target')
这篇关于Scikit学习平衡子采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!