如何使用非唯一的 bin 边缘进行 qcut? [英] How to qcut with non unique bin edges?

查看:25
本文介绍了如何使用非唯一的 bin 边缘进行 qcut?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题和上一个一样:

在熊猫中使用零值进行分箱

但是,我仍然想在分位数中包含 0 值.有没有办法做到这一点?换句话说,如果我有 600 个值,其中 50% 是 0,其余的比方说在 1 到 100 之间,我将如何将所有 0 值归类到 fractile 1 中,然后是其余的非零值在分位数标签 2 到 10 中(假设我想要 10 个分位数).我可以将 0 转换为 nan,将剩余的非 nan 数据 qcut 为 9 个分位数(1 到 9),然后将 1 添加到每个标签(现在是 2 到 10)并手动将所有 0 值标记为分位数 1?即使这很棘手,因为在我的数据集中除了 600 个值之外,我还有另外几百个在我将 0 转换为 nan 之前可能已经是 nan.

14 年 1 月 26 日更新:

我想出了以下临时解决方案.但是,此代码的问题在于,如果高频值不在分布的边缘,则它会在现有 bin 集的中间插入一个额外的 bin,并将所有内容稍微(或很多)抛掉.

def fractile_cut(ser, num_fractiles):num_valid = ser.valid().shape[0]剩余分数 = num_fractilesvcounts = ser.value_counts()high_freq = []我 = 0而 vcounts.iloc[i] >num_valid/浮点数(remain_fractiles):curr_val = vcounts.index[i]high_freq.append(curr_val)剩余分数 -= 1num_valid = num_valid - vcounts[i]我 += 1curr_ser = ser.copy()curr_ser = curr_ser[~curr_ser.isin(high_freq)]qcut = pd.qcut(curr_ser,retain_fractiles,retbins=True)qcut_bins = qcut[1]all_bins = 列表(qcut_bins)对于 high_freq 中的 val:bisect.insort(all_bins, val)cut = pd.cut(ser, bins=all_bins)ser_fractiles = pd.Series(cut.labels + 1, index=ser.index)返回 ser_fractiles

解决方案

问题是pandas.qcut 选择了 bins/quantiles,使得每个 bins/quantiles 都有相同的记录数,但是所有的记录具有相同值的必须留在相同的 bin/quantile 中(这种行为符合 quantile 的统计定义).

解决方案是:

1 - 使用具有 >此修复.他们添加了一个选项 duplicates='raise'|'drop' 来控制是在重复的边缘上升高还是丢弃它们,这将导致比指定的 bin 少,并且一些更大(具有更多元素)比其他人.

2 - 减少分位数的数量.较少的分位数意味着每个分位数有更多的元素

3 - 使用 DataFrame.rank(method='first') 对您的数据进行排名.排名为数据帧中的每个元素分配一个唯一值(排名),同时保持元素的顺序(相同的值除外,它们将按照它们在数组中出现的顺序进行排名,参见 method='first')

示例:

pd.qcut(df, nbins) <-- 这会生成ValueError: Bin edge must be unique";

然后改用这个:

pd.qcut(df.rank(method='first'), nbins)

4 - 指定一个自定义分位数范围,例如[0, .50, .75, 1.] 得到每个分位数的项目数不等

5 - 使用 pandas.cut 根据值本身选择均匀间隔的 bin,而 pandas.qcut 选择 bin 以便您在每个 bin 中拥有相同数量的记录

My question is the same as this previous one:

Binning with zero values in pandas

however, I still want to include the 0 values in a fractile. Is there a way to do this? In other words, if I have 600 values, 50% of which are 0, and the rest are let's say between 1 and 100, how would I categorize all the 0 values in fractile 1, and then the rest of the non-zero values in fractile labels 2 to 10 (assuming I want 10 fractiles). Could I convert the 0's to nan, qcut the remaining non nan data into 9 fractiles (1 to 9), then add 1 to each label (now 2 to 10) and label all the 0 values as fractile 1 manually? Even this is tricky, because in my data set in addition to the 600 values, I also have another couple hundred which may already be nan before I would convert the 0s to nan.

Update 1/26/14:

I came up with the following interim solution. The problem with this code though, is if the high frequency value is not on the edges of the distribution, then it inserts an extra bin in the middle of the existing set of bins and throws everything a little (or a lot) off.

def fractile_cut(ser, num_fractiles):
    num_valid = ser.valid().shape[0]
    remain_fractiles = num_fractiles
    vcounts = ser.value_counts()
    high_freq = []
    i = 0
    while vcounts.iloc[i] > num_valid/ float(remain_fractiles):
        curr_val = vcounts.index[i]
        high_freq.append(curr_val)
        remain_fractiles -= 1
        num_valid = num_valid - vcounts[i]
        i += 1
    curr_ser = ser.copy()
    curr_ser = curr_ser[~curr_ser.isin(high_freq)]
    qcut = pd.qcut(curr_ser, remain_fractiles, retbins=True)
    qcut_bins = qcut[1]
    all_bins = list(qcut_bins)
    for val in high_freq:
        bisect.insort(all_bins, val)
    cut = pd.cut(ser, bins=all_bins)
    ser_fractiles = pd.Series(cut.labels + 1, index=ser.index)
    return ser_fractiles

解决方案

The problem is that pandas.qcut chooses the bins/quantiles so that each one has the same number of records, but all records with the same value must stay in the same bin/quantile (this behaviour is in accordance with the statistical definition of quantile).

The solutions are:

1 - Use pandas >= 0.20.0 that has this fix. They added an option duplicates='raise'|'drop' to control whether to raise on duplicated edges or to drop them, which would result in less bins than specified, and some larger (with more elements) than others.

2 - Decrease the number of quantiles. Less quantiles means more elements per quantile

3 - Rank your data with DataFrame.rank(method='first'). The ranking assigns a unique value to each element in the dataframe (the rank) while keeping the order of the elements (except for identical values, which will be ranked in order they appear in the array, see method='first')

Example:

pd.qcut(df, nbins) <-- this generates "ValueError: Bin edges must be unique"

Then use this instead:

pd.qcut(df.rank(method='first'), nbins)

4 - Specify a custom quantiles range, e.g. [0, .50, .75, 1.] to get unequal number of items per quantile

5 - Use pandas.cut that chooses the bins to be evenly spaced according to the values themselves, while pandas.qcut chooses the bins so that you have the same number of records in each bin

这篇关于如何使用非唯一的 bin 边缘进行 qcut?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆