在Python中以 pandas 的方式对数据框进行装箱 [英] binning a dataframe in pandas in Python

查看:85
本文介绍了在Python中以 pandas 的方式对数据框进行装箱的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在熊猫中提供了以下数据框:

given the following dataframe in pandas:

import numpy as np
df = pandas.DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id": np.arange(100)})

其中,id是由ab值组成的每个点的ID,我如何将ab装箱到指定的箱柜中(这样我便可以将每个仓中ab的中位数/平均值)?对于df中的任何给定行,dfab(或两者)可能具有NaN值.谢谢.

where id is an id for each point consisting of an a and b value, how can I bin a and b into a specified set of bins (so that I can then take the median/average value of a and b in each bin)? df might have NaN values for a or b (or both) for any given row in df. thanks.

这是将Joe Kington的解决方案与更实际的df结合使用的更好示例.我不确定的是如何访问以下每个df.a组的df.b元素:

Here's a better example using Joe Kington's solution with a more realistic df. The thing I'm unsure about is how to access the df.b elements for each df.a group below:

a = np.random.random(20)
df = pandas.DataFrame({"a": a, "b": a + 10})
# bins for df.a
bins = np.linspace(0, 1, 10)
# bin df according to a
groups = df.groupby(np.digitize(df.a,bins))
# Get the mean of a in each group
print groups.mean()
## But how to get the mean of b for each group of a?
# ...

推荐答案

也许有一种更有效的方法(我觉得pandas.crosstab在这里会很有用),但这是我的方法:

There may be a more efficient way (I have a feeling pandas.crosstab would be useful here), but here's how I'd do it:

import numpy as np
import pandas

df = pandas.DataFrame({"a": np.random.random(100),
                       "b": np.random.random(100),
                       "id": np.arange(100)})

# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(np.digitize(df.a, bins))

# Get the mean of each bin:
print groups.mean() # Also could do "groups.aggregate(np.mean)"

# Similarly, the median:
print groups.median()

# Apply some arbitrary function to aggregate binned data
print groups.aggregate(lambda x: np.mean(x[x > 0.5]))


由于OP专门要求通过a中的值进行装箱的b的均值,所以只需


As the OP was asking specifically for just the means of b binned by the values in a, just do

groups.mean().b

此外,如果您希望索引看起来更好(例如,显示间隔作为索引),就像@bdiamante的示例中那样,请使用pandas.cut而不是numpy.digitize. (对比达姆安特表示敬意.我没有意识到pandas.cut的存在.)

Also if you wanted the index to look nicer (e.g. display intervals as the index), as they do in @bdiamante's example, use pandas.cut instead of numpy.digitize. (Kudos to bidamante. I didn't realize pandas.cut existed.)

import numpy as np
import pandas

df = pandas.DataFrame({"a": np.random.random(100), 
                       "b": np.random.random(100) + 10})

# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(pandas.cut(df.a, bins))

# Get the mean of b, binned by the values in a
print groups.mean().b

结果是:

a
(0.00186, 0.111]    10.421839
(0.111, 0.22]       10.427540
(0.22, 0.33]        10.538932
(0.33, 0.439]       10.445085
(0.439, 0.548]      10.313612
(0.548, 0.658]      10.319387
(0.658, 0.767]      10.367444
(0.767, 0.876]      10.469655
(0.876, 0.986]      10.571008
Name: b

这篇关于在Python中以 pandas 的方式对数据框进行装箱的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆