快速 pandas 过滤 [英] Fast pandas filtering

查看:154
本文介绍了快速 pandas 过滤的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果名称列条目在给定列表中有一个项目,我想过滤一个熊猫数据框.

I want to filter a pandas dataframe, if the name column entry has an item in a given list.

这里有一个DataFrame

Here we have a DataFrame

x = DataFrame(
    [['sam', 328], ['ruby', 3213], ['jon', 121]], 
    columns=['name', 'score'])

现在让我们说我们有一个列表,['sam', 'ruby'],我们想找到列表中名称所在的所有行,然后对分数求和.

Now lets say we have a list, ['sam', 'ruby'] and we want to find all rows where the name is in the list, then sum the score.

我的解决方法如下:

total = 0
names = ['sam', 'ruby']
for name in names:
     identified = x[x['name'] == name]
     total = total + sum(identified['score'])

但是,当数据帧变得非常大,并且名称列表也变得非常大时,一切都会非常缓慢.

However when the dataframe gets extremely large, and the list of names gets very large too, everything is very very slow.

有没有更快的选择?

谢谢

推荐答案

尝试使用 isin (感谢DSM在这里建议loc而不是ix):

Try using isin (thanks to DSM for suggesting loc over ix here):

In [78]: x = pd.DataFrame([['sam',328],['ruby',3213],['jon',121]], columns = ['name', 'score'])

In [79]: names = ['sam', 'ruby']

In [80]: x['name'].isin(names)
Out[80]: 
0     True
1     True
2    False
Name: name, dtype: bool

In [81]: x.loc[x['name'].isin(names), 'score'].sum()
Out[81]: 3541


Zhu Zhu建议使用np.in1d更快的替代方法:


CT Zhu suggests a faster alternative using np.in1d:

In [105]: y = pd.concat([x]*1000)
In [109]: %timeit y.loc[y['name'].isin(names), 'score'].sum()
1000 loops, best of 3: 413 µs per loop

In [110]: %timeit y.loc[np.in1d(y['name'], names), 'score'].sum()
1000 loops, best of 3: 335 µs per loop

这篇关于快速 pandas 过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆