pandas :查找每人最常见的字符串 [英] Pandas: Find most common string per person

查看:68
本文介绍了 pandas :查找每人最常见的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在按id聚合数据时在animal中找到最常见的字符串值,如果计数相同,我想选择animal的最后一个值.

I would like find the most common string value in animal when aggregating data by id, if the count is the same, I would like to pick the last value of animal.

   id   animal       date
0   1    dog      2018-01-01
1   1    dog      2018-01-02
2   1    cat      2018-01-03
3   2    cat      2018-01-01
4   3    dog      2018-01-01
5   4   fish      2018-01-01
6   5    dog      2018-01-01
7   5    cat      2018-01-02

输出应类似于:

   id animal
0  1   dog
1  2   cat
2  3   dog
3  4   fish
4  5   cat

我无法使其正常工作.我尝试使用pd.get_dummies并计数但没有外观.理想情况下,该解决方案将用于构建,矢量化的pandas/numpy(即过滤,联接,np.where等),因为groupby.apply速度非常慢且数据相当可观.

I haven't been able to get this to work properly. I tried using pd.get_dummies and counting but not look. Ideally, the solution will use in build, vectorised pandas/numpy, i.e. filtering, join, np.where, etc as groupby.applyis very slow and data is somewhat sizable.

推荐答案

id& animal列,并获取它们出现的countlast日期.

group by id & animal columns and get the count and last date on which they appeared.

然后按idcountlast对结果数据帧进行排序,并将重复值放在id上,保留最后一行(由于我们的排序而定),这将得出最常见的动物,如果有两只动物,表中最后被观察到的动物.最后,摆脱多余的列count& last

then sort the resulting data frame by id, count, last and drop duplicate values on id, keeping the last row, which due to our ordering, will give the most common animal, and if there are two animals, the animal that was last observed in the table. finally, get rid of the extra columns count & last

columns = ['id', 'animal']

df2 = df.groupby(columns).date.agg(['count', 'last']).reset_index()
df3 = df2.sort_values(['id', 'count', 'last'])
df3.drop_duplicates('id', keep='last')[columns]

# outputs:

   id animal
1   1    dog
2   2    cat
3   3    dog
4   4   fish
5   5    cat

这篇关于 pandas :查找每人最常见的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆