做groupby时保留其他列 [英] Keep other columns when doing groupby
问题描述
我在 Pandas 数据帧上使用 groupby
来删除所有没有特定列最小值的行.像这样:
df1 = df.groupby("item", as_index=False)["diff"].min()
但是,如果我的列多于这两列,则其他列(例如,在我的示例中为 otherstuff
)将被删除.我可以使用 groupby
保留这些列,还是必须找到不同的方法来删除行?
我的数据看起来像:
item diff otherstuff0 1 2 11 1 1 22 1 3 73 2 -1 04 2 1 35 2 4 96 2 -6 27 3 0 08 3 2 9
最后应该是:
item diff otherstuff0 1 1 21 2 -6 22 3 0 0
但我得到的是:
项目差异0 1 11 2 -62 3 0
我一直在查看文档,但找不到任何内容.我试过了:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
但这些都不起作用(我在最后一个意识到语法用于在创建组后进行聚合).
方法#1:使用idxmin()
来获取元素的索引>diff
,然后选择那些:
方法#2:按diff
排序,然后取每个item
组中的第一个元素:
请注意,即使行内容相同,生成的索引也不同.
I'm using groupby
on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff
in my example) get dropped. Can I keep those columns using groupby
, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin()
to get the indices of the elements of minimum diff
, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff
, and then take the first element in each item
group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
这篇关于做groupby时保留其他列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!