如何使用非唯一bin边缘进行qcut? [英] How to qcut with non unique bin edges?

查看:146
本文介绍了如何使用非唯一bin边缘进行qcut?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题与上一个相同:

My question is the same as this previous one:

在大熊猫中使用零值进行装箱

但是,我仍然要在分数中包括0值.有没有办法做到这一点?换句话说,如果我有600个值,其中50%是0,其余的假设是1到100之间,我将如何对分形1中的所有0值进行分类,然后再对其余的非零值进行分类在2到10的碎裂标签中(假设我要10碎裂).我可以将0转换为nan,将剩余的非nan数据切成9个分数(1到9),然后将1加到每个标签(现在是2到10),然后将所有0值手动标记为fractile 1吗?即使这很棘手,因为在我的数据集中,除了600个值之外,在将0转换为nan之前,我还有另外几百个可能已经是nan了.

however, I still want to include the 0 values in a fractile. Is there a way to do this? In other words, if I have 600 values, 50% of which are 0, and the rest are let's say between 1 and 100, how would I categorize all the 0 values in fractile 1, and then the rest of the non-zero values in fractile labels 2 to 10 (assuming I want 10 fractiles). Could I convert the 0's to nan, qcut the remaining non nan data into 9 fractiles (1 to 9), then add 1 to each label (now 2 to 10) and label all the 0 values as fractile 1 manually? Even this is tricky, because in my data set in addition to the 600 values, I also have another couple hundred which may already be nan before I would convert the 0s to nan.

更新1/26/14:

我想出了以下临时解决方案.但是,此代码的问题是,如果高频值不在分布的边缘,则它会在现有的一组bin的中间插入一个额外的bin,并将所有东西扔掉一些(或很多). /p>

I came up with the following interim solution. The problem with this code though, is if the high frequency value is not on the edges of the distribution, then it inserts an extra bin in the middle of the existing set of bins and throws everything a little (or a lot) off.

def fractile_cut(ser, num_fractiles):
    num_valid = ser.valid().shape[0]
    remain_fractiles = num_fractiles
    vcounts = ser.value_counts()
    high_freq = []
    i = 0
    while vcounts.iloc[i] > num_valid/ float(remain_fractiles):
        curr_val = vcounts.index[i]
        high_freq.append(curr_val)
        remain_fractiles -= 1
        num_valid = num_valid - vcounts[i]
        i += 1
    curr_ser = ser.copy()
    curr_ser = curr_ser[~curr_ser.isin(high_freq)]
    qcut = pd.qcut(curr_ser, remain_fractiles, retbins=True)
    qcut_bins = qcut[1]
    all_bins = list(qcut_bins)
    for val in high_freq:
        bisect.insort(all_bins, val)
    cut = pd.cut(ser, bins=all_bins)
    ser_fractiles = pd.Series(cut.labels + 1, index=ser.index)
    return ser_fractiles

推荐答案

问题是,pandas.qcut选择了bin,以便每个bin/quantile中的记录数相同,但是具有相同值的记录不能进入不同的bin/分位数.

The problem is that pandas.qcut chooses the bins so that you have the same number of records in each bin/quantile, but records with the same value cannot go in different bins/quantiles.

解决方案是:

1-使用具有此修复程序.他们添加了一个选项duplicates='raise'|'drop'来控制是在重复的边上抬起还是放下它们,这将导致垃圾箱数量少于指定的数量,并且某些垃圾箱(包含更多的元素)会比其他数量大.

1 - Use pandas >= 0.20.0 that has this fix. They added an option duplicates='raise'|'drop' to control whether to raise on duplicated edges or to drop them, which would result in less bins than specified, and some larger (with more elements) than others.

2-使用pandas.cut 来根据值本身选择要均匀分布的垃圾箱,而pandas.qcut则选择垃圾箱,以便每个垃圾箱中的记录数相同

2 - Use pandas.cut that chooses the bins to be evenly spaced according to the values themselves, while pandas.qcut chooses the bins so that you have the same number of records in each bin

3- 减少 分位数的数量.分位数越少,则每个分位数意味着更多的元素

3 - Decrease the number of quantiles. Less quantiles means more elements per quantile

4-指定自定义分位数范围,例如[0,.50,.75,1.]以获得每个分位数不相等的项目数

4 - Specify a custom quantiles range, e.g. [0, .50, .75, 1.] to get unequal number of items per quantile

5-使用DataFrame.rank(方法=第一")对数据进行排名.排序为数据帧中的每个元素(排序)分配一个唯一的值,同时保持元素的顺序(相同的值除外,相同的值将按它们在数组中出现的顺序进行排序,请参见method ='first').这样可以解决此问题,但是您可能会将相同(预排名)的值放入不同的分位数中,根据您的意图,该分位数是否正确.

5 - Rank your data with DataFrame.rank(method='first'). The ranking assigns a unique value to each element in the dataframe (the rank) while keeping the order of the elements (except for identical values, which will be ranked in order they appear in the array, see method='first'). This fixes the issue but you might have that identical (pre-ranking) values go into different quantiles, which can be correct or not depending on your intent.

示例:

pd.qcut(df, nbins) <-- this generates "ValueError: Bin edges must be unique"

然后改用它:

pd.qcut(df.rank(method='first'), nbins)

这篇关于如何使用非唯一bin边缘进行qcut?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆