pandas groupby方法实际上是如何工作的? [英] How is pandas groupby method actually working?
问题描述
所以我试图理解pandas.dataFrame.groupby()函数,并且在文档中遇到了这个示例:
So I was trying to understand pandas.dataFrame.groupby() function and I came across this example on the documentation:
In [1]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : np.random.randn(8),
...: 'D' : np.random.randn(8)})
...:
In [2]: df
Out[2]:
A B C D
0 foo one 0.469112 -0.861849
1 bar one -0.282863 -2.104569
2 foo two -1.509059 -0.494929
3 bar three -1.135632 1.071804
4 foo two 1.212112 0.721555
5 bar two -0.173215 -0.706771
6 foo one 0.119209 -1.039575
7 foo three -1.044236 0.271860
为了进一步探讨,我做到了:
Not to further explore I did this:
print(df.groupby('B').head())
它输出相同的dataFrame,但是当我这样做时:
it outputs the same dataFrame but when I do this:
print(df.groupby('B'))
它给了我这个:
<pandas.core.groupby.DataFrameGroupBy object at 0x7f65a585b390>
这是什么意思?在正常的dataFrame打印中,.head()
仅输出前5行,这是怎么回事?
What does this mean? In a normal dataFrame printing .head()
simply outputs the first 5 rows what's happening here?
还有为什么打印.head()
会提供与数据框相同的输出?
And also why does printing .head()
gives the same output as the dataframe? Shouldn't it be grouped by the elements of the column 'B'
?
推荐答案
仅使用时
df.groupby('A')
您得到一个 GroupBy
对象.此时您尚未对其应用任何功能.在幕后,虽然这个定义可能并不完美,但您可以将groupby
对象视为:
You get a GroupBy
object. You haven't applied any function to it at that point. Under the hood, while this definition might not be perfect, you can think of a groupby
object as:
- (组,DataFrame)对的迭代器,用于DataFrame或
- 针对Series的((组,系列))对的迭代器.
- An iterator of (group, DataFrame) pairs, for DataFrames, or
- An iterator of (group, Series) pairs, for Series.
说明:
df = DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, 2, 3, 4]})
grouped = df.groupby('A')
# each `i` is a tuple of (group, DataFrame)
# so your output here will be a little messy
for i in grouped:
print(i)
(1, A B
0 1 1
1 1 2)
(2, A B
2 2 3
3 2 4)
# this version uses multiple counters
# in a single loop. each `group` is a group, each
# `df` is its corresponding DataFrame
for group, df in grouped:
print('group of A:', group, '\n')
print(df, '\n')
group of A: 1
A B
0 1 1
1 1 2
group of A: 2
A B
2 2 3
3 2 4
# and if you just wanted to visualize the groups,
# your second counter is a "throwaway"
for group, _ in grouped:
print('group of A:', group, '\n')
group of A: 1
group of A: 2
现在与.head
相同.只需查看 docs 表示该方法:
Now as for .head
. Just have a look at the docs for that method:
基本上等同于
.apply(lambda x: x.head(n))
因此,这里实际上是对groupby对象的每个组应用一个函数.请记住,每个组(每个DataFrame)都应用了.head(5)
,因此,由于每个组少于或等于5行,因此可以得到原始的DataFrame.
So here you're actually applying a function to each group of the groupby object. Keep in mind .head(5)
is applied to each group (each DataFrame), so because you have less than or equal to 5 rows per group, you get your original DataFrame.
请参考上面的示例.如果使用.head(1)
,则只会得到每个组的前1行:
Consider this with the example above. If you use .head(1)
, you get only the first 1 row of each group:
print(df.groupby('A').head(1))
A B
0 1 1
2 2 3
这篇关于 pandas groupby方法实际上是如何工作的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!