替代 pandas "isin"功能的更快方法 [英] A faster alternative to Pandas `isin` function

查看:136
本文介绍了替代 pandas "isin"功能的更快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的数据框df,看起来像:

I have a very large data frame df that looks like:

ID       Value1    Value2
1345      3.2      332
1355      2.2      32
2346      1.0      11
3456      8.9      322

我有一个列表,其中包含ID ID_list的子集.我需要ID_list中包含的IDdf子集.

And I have a list that contains a subset of IDs ID_list. I need to have a subset of df for the ID contained in ID_list.

当前,我正在使用df_sub=df[df.ID.isin(ID_list)]来执行此操作.但是,这需要很多时间. ID_list中包含的ID没有任何模式,因此不在特定范围内. (而且我需要对许多相似的数据帧应用相同的操作.我想知道是否有任何更快的方法可以做到这一点.如果将ID作为索引,这会有所帮助吗?

Currently, I am using df_sub=df[df.ID.isin(ID_list)] to do it. But it takes a lot time. IDs contained in ID_list doesn't have any pattern, so it's not within certain range. (And I need to apply the same operation to many similar dataframes. I was wondering if there is any faster way to do this. Will it help a lot if make ID as the index?

谢谢!

推荐答案

这是一个指向各种pandas操作性能的更近期视图的链接,尽管它似乎不包括合并和联接.日期.

EDIT 2: Here's a link to a more recent look into the performance of various pandas operations, though it doesn't seem to include merge and join to date.

https://github.com/mm-mansour/Fast-Pandas

这些基准是针对较旧版本的熊猫的,可能仍然不相关.请参阅下面merge上Mike的评论.

EDIT 1: These benchmarks were for a quite old version of pandas and likely are not still relevant. See Mike's comment below on merge.

这取决于您的数据大小,但对于大型数据集name以便与join一起使用,并将其作为名为name的新字段引入.您还需要指定一个内部联接以获取类似isin的内容,因为join默认为左联接.对于大型数据集,查询in语法似乎具有与isin相同的速度特征.

It depends on the size of your data but for large datasets DataFrame.join seems to be the way to go. This requires your DataFrame index to be your 'ID' and the Series or DataFrame you're joining against to have an index that is your 'ID_list'. The Series must also have a name to be used with join, which gets pulled in as a new field called name. You also need to specify an inner join to get something like isin because join defaults to a left join. query in syntax seems to have the same speed characteristics as isin for large datasets.

如果使用小型数据集,您会得到不同的行为,与使用isin相比,使用列表推导或将其应用于字典实际上会变得更快.

If you're working with small datasets, you get different behaviors and it actually becomes faster to use a list comprehension or apply against a dictionary than using isin.

否则,您可以尝试使用 Cython 来提高速度.

Otherwise, you can try to get more speed with Cython.

# I'm ignoring that the index is defaulting to a sequential number. You
# would need to explicitly assign your IDs to the index here, e.g.:
# >>> l_series.index = ID_list
mil = range(1000000)
l = mil
l_series = pd.Series(l)

df = pd.DataFrame(l_series, columns=['ID'])


In [247]: %timeit df[df.index.isin(l)]
1 loops, best of 3: 1.12 s per loop

In [248]: %timeit df[df.index.isin(l_series)]
1 loops, best of 3: 549 ms per loop

# index vs column doesn't make a difference here
In [304]: %timeit df[df.ID.isin(l_series)]
1 loops, best of 3: 541 ms per loop

In [305]: %timeit df[df.index.isin(l_series)]
1 loops, best of 3: 529 ms per loop

# query 'in' syntax has the same performance as 'isin'
In [249]: %timeit df.query('index in @l')
1 loops, best of 3: 1.14 s per loop

In [250]: %timeit df.query('index in @l_series')
1 loops, best of 3: 564 ms per loop

# ID must be the index for DataFrame.join and l_series must have a name.
# join defaults to a left join so we need to specify inner for existence.
In [251]: %timeit df.join(l_series, how='inner')
10 loops, best of 3: 93.3 ms per loop

# Smaller datasets.
df = pd.DataFrame([1,2,3,4], columns=['ID'])
l = range(10000)
l_dict = dict(zip(l, l))
l_series = pd.Series(l)
l_series.name = 'ID_list'


In [363]: %timeit df.join(l_series, how='inner')
1000 loops, best of 3: 733 µs per loop

In [291]: %timeit df[df.ID.isin(l_dict)]
1000 loops, best of 3: 742 µs per loop

In [292]: %timeit df[df.ID.isin(l)]
1000 loops, best of 3: 771 µs per loop

In [294]: %timeit df[df.ID.isin(l_series)]
100 loops, best of 3: 2 ms per loop

# It's actually faster to use apply or a list comprehension for these small cases.
In [296]: %timeit df[[x in l_dict for x in df.ID]]
1000 loops, best of 3: 203 µs per loop

In [299]: %timeit df[df.ID.apply(lambda x: x in l_dict)]
1000 loops, best of 3: 297 µs per loop

这篇关于替代 pandas "isin"功能的更快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆