快速 pandas 过滤 [英] Fast pandas filtering
问题描述
如果名称列条目在给定列表中有一个项目,我想过滤一个熊猫数据框.
I want to filter a pandas dataframe, if the name column entry has an item in a given list.
这里有一个DataFrame
Here we have a DataFrame
x = DataFrame(
[['sam', 328], ['ruby', 3213], ['jon', 121]],
columns=['name', 'score'])
现在让我们说我们有一个列表,['sam', 'ruby']
,我们想找到列表中名称所在的所有行,然后对分数求和.
Now lets say we have a list, ['sam', 'ruby']
and we want to find all rows where the name is in the list, then sum the score.
我的解决方法如下:
total = 0
names = ['sam', 'ruby']
for name in names:
identified = x[x['name'] == name]
total = total + sum(identified['score'])
但是,当数据帧变得非常大,并且名称列表也变得非常大时,一切都会非常缓慢.
However when the dataframe gets extremely large, and the list of names gets very large too, everything is very very slow.
有没有更快的选择?
谢谢
推荐答案
尝试使用 isin (感谢DSM在这里建议loc
而不是ix
):
Try using isin (thanks to DSM for suggesting loc
over ix
here):
In [78]: x = pd.DataFrame([['sam',328],['ruby',3213],['jon',121]], columns = ['name', 'score'])
In [79]: names = ['sam', 'ruby']
In [80]: x['name'].isin(names)
Out[80]:
0 True
1 True
2 False
Name: name, dtype: bool
In [81]: x.loc[x['name'].isin(names), 'score'].sum()
Out[81]: 3541
Zhu Zhu建议使用np.in1d
更快的替代方法:
CT Zhu suggests a faster alternative using np.in1d
:
In [105]: y = pd.concat([x]*1000)
In [109]: %timeit y.loc[y['name'].isin(names), 'score'].sum()
1000 loops, best of 3: 413 µs per loop
In [110]: %timeit y.loc[np.in1d(y['name'], names), 'score'].sum()
1000 loops, best of 3: 335 µs per loop
这篇关于快速 pandas 过滤的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!