pandas.qcut和pandas.cut有什么区别? [英] What is the difference between pandas.qcut and pandas.cut?

查看:128
本文介绍了pandas.qcut和pandas.cut有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

文档说:

http://pandas.pydata.org/pandas-docs/dev/basics.html

可以使用cut(基于值的仓位)和qcut(基于样本分位数的仓位)函数离散化连续值"

"Continuous values can be discretized using the cut (bins based on values) and qcut (bins based on sample quantiles) functions"

对我来说听起来很抽象...我可以在下面的示例中看到差异,但是 qcut(样本分位数)实际上在做什么/意味着什么?什么时候使用qcut和cut?

Sounds very abstract to me... I can see the differences in the example below but what does qcut (sample quantile) actually do/mean? When would you use qcut versus cut?

谢谢.

factors = np.random.randn(30)

In [11]:
pd.cut(factors, 5)
Out[11]:
[(-0.411, 0.575], (-0.411, 0.575], (-0.411, 0.575], (-0.411, 0.575], (0.575, 1.561], ..., (-0.411, 0.575], (-1.397, -0.411], (0.575, 1.561], (-2.388, -1.397], (-0.411, 0.575]]
Length: 30
Categories (5, object): [(-2.388, -1.397] < (-1.397, -0.411] < (-0.411, 0.575] < (0.575, 1.561] < (1.561, 2.547]]

In [14]:
pd.qcut(factors, 5)
Out[14]:
[(-0.348, 0.0899], (-0.348, 0.0899], (0.0899, 1.19], (0.0899, 1.19], (0.0899, 1.19], ..., (0.0899, 1.19], (-1.137, -0.348], (1.19, 2.547], [-2.383, -1.137], (-0.348, 0.0899]]
Length: 30
Categories (5, object): [[-2.383, -1.137] < (-1.137, -0.348] < (-0.348, 0.0899] < (0.0899, 1.19] < (1.19, 2.547]]`

推荐答案

首先,请注意,分位数只是百分位数,四分位数和中位数之类的最通用术语.您在示例中指定了五个垃圾箱,因此您要向qcut要求五分位数.

To begin, note that quantiles is just the most general term for things like percentiles, quartiles, and medians. You specified five bins in your example, so you are asking qcut for quintiles.

因此,当您要求使用qcut的五分位数时,将选择bin,以便每个bin中的记录数相同.您有30条记录,因此每个bin中应有6条(您的输出应如下所示,尽管断点会因随机绘制而有所不同):

So, when you ask for quintiles with qcut, the bins will be chosen so that you have the same number of records in each bin. You have 30 records, so should have 6 in each bin (your output should look like this, although the breakpoints will differ due to the random draw):

pd.qcut(factors, 5).value_counts()

[-2.578, -0.829]    6
(-0.829, -0.36]     6
(-0.36, 0.366]      6
(0.366, 0.868]      6
(0.868, 2.617]      6

相反,对于cut,您会看到不平衡的地方:

Conversely, for cut you will see something more uneven:

pd.cut(factors, 5).value_counts()

(-2.583, -1.539]    5
(-1.539, -0.5]      5
(-0.5, 0.539]       9
(0.539, 1.578]      9
(1.578, 2.617]      2

这是因为cut将根据值本身而不是这些值的 frequency 选择要均匀分布的垃圾箱.因此,由于您是从随机法线中提取的,因此您会看到内部垃圾箱中的频率更高,而外部垃圾箱中的频率更低.本质上,这将是直方图的表格形式(您会期望它具有30条记录的相当钟形).

That's because cut will choose the bins to be evenly spaced according to the values themselves and not the frequency of those values. Hence, because you drew from a random normal, you'll see higher frequencies in the inner bins and fewer in the outer. This is essentially going to be a tabular form of a histogram (which you would expect to be fairly bell shaped with 30 records).

这篇关于pandas.qcut和pandas.cut有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆