与Pandas系列一起在运算符中使用 [英] Using in operator with Pandas series

查看:103
本文介绍了与Pandas系列一起在运算符中使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么我不能使用in匹配熊猫系列中的字符串?在下面的示例中,第一个评估意外导致False,但是第二个评估有效.

Why can't I match a string in a Pandas series using in? In the following example, the first evaluation results in False unexpectedly, but the second one works.

df = pd.DataFrame({'name': [ 'Adam', 'Ben', 'Chris' ]})
'Adam' in df['name']
'Adam' in list(df['name'])

推荐答案

在第一种情况下:

因为in运算符被解释为对df['name'].__contains__('Adam')的调用.如果查看pandas.Series__contains__的实现,则会发现以下内容(从pandas.core.generic.NDFrame插入):

In the first case:

Because the in operator is interpreted as a call to df['name'].__contains__('Adam'). If you look at the implementation of __contains__ in pandas.Series, you will find that it's the following (inhereted from pandas.core.generic.NDFrame) :

def __contains__(self, key):
    """True if the key is in the info axis"""
    return key in self._info_axis

因此,您第一次使用in被解释为:

so, your first use of in is interpreted as:

'Adam' in df['name']._info_axis 

可以得到False,因为df['name']._info_axis实际上包含有关range/index的信息,而不是数据本身:

This gives False, expectedly, because df['name']._info_axis actually contains information about the range/index and not the data itself:

In [37]: df['name']._info_axis 
Out[37]: RangeIndex(start=0, stop=3, step=1)

In [38]: list(df['name']._info_axis) 
Out[38]: [0, 1, 2]


在第二种情况下:

'Adam' in list(df['name'])

使用list会将pandas.Series转换为值列表.因此,实际的操作是这样的:

The use of list, converts the pandas.Series to a list of the values. So, the actual operation is this:

In [42]: list(df['name'])
Out[42]: ['Adam', 'Ben', 'Chris']

In [43]: 'Adam' in ['Adam', 'Ben', 'Chris']
Out[43]: True


以下是一些惯用的方法(以相关的速度)来完成您想要的事情:


Here are few more idiomatic ways to do what you want (with the associated speed):

In [56]: df.name.str.contains('Adam').any()
Out[56]: True

In [57]: timeit df.name.str.contains('Adam').any()
The slowest run took 6.25 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 144 µs per loop

In [58]: df.name.isin(['Adam']).any()
Out[58]: True

In [59]: timeit df.name.isin(['Adam']).any()
The slowest run took 5.13 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 191 µs per loop

In [60]: df.name.eq('Adam').any()
Out[60]: True

In [61]: timeit df.name.eq('Adam').any()
10000 loops, best of 3: 178 µs per loop

注意:@Wen在上面的评论中也建议使用最后一种方法

Note: the last way is also suggested by @Wen in the comment above

这篇关于与Pandas系列一起在运算符中使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆