在Python中以 pandas 的方式对数据框进行装箱 [英] binning a dataframe in pandas in Python
问题描述
在熊猫中提供了以下数据框:
given the following dataframe in pandas:
import numpy as np
df = pandas.DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id": np.arange(100)})
其中,id
是由a
和b
值组成的每个点的ID,我如何将a
和b
装箱到指定的箱柜中(这样我便可以将每个仓中a
和b
的中位数/平均值)?对于df
中的任何给定行,df
的a
或b
(或两者)可能具有NaN
值.谢谢.
where id
is an id for each point consisting of an a
and b
value, how can I bin a
and b
into a specified set of bins (so that I can then take the median/average value of a
and b
in each bin)? df
might have NaN
values for a
or b
(or both) for any given row in df
. thanks.
这是将Joe Kington的解决方案与更实际的df结合使用的更好示例.我不确定的是如何访问以下每个df.a组的df.b元素:
Here's a better example using Joe Kington's solution with a more realistic df. The thing I'm unsure about is how to access the df.b elements for each df.a group below:
a = np.random.random(20)
df = pandas.DataFrame({"a": a, "b": a + 10})
# bins for df.a
bins = np.linspace(0, 1, 10)
# bin df according to a
groups = df.groupby(np.digitize(df.a,bins))
# Get the mean of a in each group
print groups.mean()
## But how to get the mean of b for each group of a?
# ...
推荐答案
也许有一种更有效的方法(我觉得pandas.crosstab
在这里会很有用),但这是我的方法:
There may be a more efficient way (I have a feeling pandas.crosstab
would be useful here), but here's how I'd do it:
import numpy as np
import pandas
df = pandas.DataFrame({"a": np.random.random(100),
"b": np.random.random(100),
"id": np.arange(100)})
# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(np.digitize(df.a, bins))
# Get the mean of each bin:
print groups.mean() # Also could do "groups.aggregate(np.mean)"
# Similarly, the median:
print groups.median()
# Apply some arbitrary function to aggregate binned data
print groups.aggregate(lambda x: np.mean(x[x > 0.5]))
由于OP专门要求通过a
中的值进行装箱的b
的均值,所以只需
As the OP was asking specifically for just the means of b
binned by the values in a
, just do
groups.mean().b
此外,如果您希望索引看起来更好(例如,显示间隔作为索引),就像@bdiamante的示例中那样,请使用pandas.cut
而不是numpy.digitize
. (对比达姆安特表示敬意.我没有意识到pandas.cut
的存在.)
Also if you wanted the index to look nicer (e.g. display intervals as the index), as they do in @bdiamante's example, use pandas.cut
instead of numpy.digitize
. (Kudos to bidamante. I didn't realize pandas.cut
existed.)
import numpy as np
import pandas
df = pandas.DataFrame({"a": np.random.random(100),
"b": np.random.random(100) + 10})
# Bin the data frame by "a" with 10 bins...
bins = np.linspace(df.a.min(), df.a.max(), 10)
groups = df.groupby(pandas.cut(df.a, bins))
# Get the mean of b, binned by the values in a
print groups.mean().b
结果是:
a
(0.00186, 0.111] 10.421839
(0.111, 0.22] 10.427540
(0.22, 0.33] 10.538932
(0.33, 0.439] 10.445085
(0.439, 0.548] 10.313612
(0.548, 0.658] 10.319387
(0.658, 0.767] 10.367444
(0.767, 0.876] 10.469655
(0.876, 0.986] 10.571008
Name: b
这篇关于在Python中以 pandas 的方式对数据框进行装箱的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!