为什么要使用 pandas qcut返回ValueError:Bin边缘必须是唯一的? [英] Why use pandas qcut return ValueError: Bin edges must be unique?

查看:427
本文介绍了为什么要使用 pandas qcut返回ValueError:Bin边缘必须是唯一的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有数据集:

recency;frequency;monetary
21;156;41879955
13;88;16850284
8;74;79150488
2;74;26733719
9;55;16162365
...;...;...

详细原始数据-> http://pastebin.com/beiEeS80 我输入了DataFrame,这是我完整的代码:

detail raw data -> http://pastebin.com/beiEeS80 and i put into DataFrame and here is my complete code :

df = pd.DataFrame(datas, columns=['userid', 'recency', 'frequency', 'monetary'])
df['recency'] = df['recency'].astype(float)
df['frequency'] = df['frequency'].astype(float)
df['monetary'] = df['monetary'].astype(float)

df['recency'] = pd.qcut(df['recency'].values, 5).codes + 1
df['frequency'] = pd.qcut(df['frequency'].values, 5).codes + 1
df['monetary'] = pd.qcut(df['monetary'].values, 5).codes + 1

但这是返回错误

df['frequency'] = pd.qcut(df['frequency'].values, 5).codes + 1
ValueError: Bin edges must be unique: array([   1.,    1.,    2.,    4.,    9.,  156.])

如何解决这个问题?

推荐答案

我在Jupyter中运行了此程序,并将exampledata.txt放置在与笔记本相同的目录中.

I ran this in Jupyter and placed the exampledata.txt to the same directory as the notebook.

请注意第一行:

df = pd.DataFrame(datas, columns=['userid', 'recency', 'frequency', 'monetary'])

在数据文件中未定义时,

加载列'userid'.我删除了此列名.

loads the colums 'userid' when it isn't defined in the data file. I removed this column name.

import pandas as pd

def pct_rank_qcut(series, n):
    edges = pd.Series([float(i) / n for i in range(n + 1)])
    f = lambda x: (edges >= x).argmax()
    return series.rank(pct=1).apply(f)

datas = pd.read_csv('./exampledata.txt', delimiter=';')

df = pd.DataFrame(datas, columns=['recency', 'frequency', 'monetary'])

df['recency'] = df['recency'].astype(float)
df['frequency'] = df['frequency'].astype(float)
df['monetary'] = df['monetary'].astype(float)

df['recency'] = pct_rank_qcut(df.recency, 5)
df['frequency'] = pct_rank_qcut(df.frequency, 5)
df['monetary'] = pct_rank_qcut(df.monetary, 5)

说明

您看到的问题是pd.qcut的结果,该结果假定5个bin大小相等.在您提供的数据中,'frequency'具有大于28%的数字1.这打破了qcut.

Explanation

The problem you were seeing was a result of pd.qcut assuming 5 bins of equal size. In the data you provided, 'frequency' has more than 28% number 1's. This broke qcut.

我提供了一个新函数pct_rank_qcut来解决这个问题,并将全1推入第一个bin.

I provided a new function pct_rank_qcut that addresses this and pushes all 1's into the first bin.

    edges = pd.Series([float(i) / n for i in range(n + 1)])

此行根据n定义的所需箱数定义一系列百分比边缘.对于n = 5,其边缘将为[0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

This line defines a series of percentile edges based on the desired number of bins defined by n. In the case of n = 5 the edges will be [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

    f = lambda x: (edges >= x).argmax()

此行定义了辅助功能,该功能将应用于下一行中的另一个系列. edges >= x将返回长度等于edges的序列,其中每个元素是TrueFalse,具体取决于x小于还是等于该边.对于x = 0.14,所得的(edges >= x)将为[False, True, True, True, True, True].通过取argmax(),我确定了系列为True的第一个索引,在本例中为1.

this line defines a helper function to be applied to another series in the next line. edges >= x will return a series equal in length to edges where each element is True or False depending on whether x is less than or equal to that edge. In the case of x = 0.14 the resulting (edges >= x) will be [False, True, True, True, True, True]. By the taking the argmax() I've identified the first index where the series is True, in this case 1.

    return series.rank(pct=1).apply(f)

此行接受输入series,并将其转换为百分等级.我可以将这些排名与创建的边缘进行比较,这就是为什么我使用apply(f)的原因.返回的应该是一系列从1到n的箱号.这一系列的箱号与您尝试获得的相同:

This line takes the input series and turns it into a percentile ranking. I can compare these rankings to the edges I've created and that's why I use the apply(f). What's returned should be a series of bin numbers numbered 1 to n. This series of bin numbers is the same thing you were trying to get with:

pd.qcut(df['recency'].values, 5).codes + 1

其结果是,垃圾箱不再相等,并且垃圾箱1完全从垃圾箱2借用.但是必须做出一些选择.如果您不喜欢这种选择,请使用此概念来建立自己的排名.

This has consequences in that the bins are no longer equal and that bin 1 borrows completely from bin 2. But some choice had to be made. If you don't like this choice, use the concept to build your own ranking.

print df.head()

   recency  frequency  monetary
0        3          5         5
1        2          5         5
2        2          5         5
3        1          5         5
4        2          5         5

更新

pd.Series.argmax()现在已弃用.只需切换到pd.Series.values.argmax()()进行更新!

Update

pd.Series.argmax() is now deprecated. Simply switch to pd.Series.values.argmax()() to update!

def pct_rank_qcut(series, n):
    edges = pd.Series([float(i) / n for i in range(n + 1)])
    f = lambda x: (edges >= x).values.argmax()
    return series.rank(pct=1).apply(f)

这篇关于为什么要使用 pandas qcut返回ValueError:Bin边缘必须是唯一的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆