在包含字符串列表的系列上使用Pandas字符串方法“包含" [英] Use Pandas string method 'contains' on a Series containing lists of strings

查看:724
本文介绍了在包含字符串列表的系列上使用Pandas字符串方法“包含"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个简单的Pandas系列,其中包含一些字符串,这些字符串可以包含一个以上的句子:

Given a simple Pandas Series that contains some strings which can consist of more than one sentence:

In:
import pandas as pd
s = pd.Series(['This is a long text. It has multiple sentences.','Do you see? More than one sentence!','This one has only one sentence though.'])

Out:
0    This is a long text. It has multiple sentences.
1                Do you see? More than one sentence!
2             This one has only one sentence though.
dtype: object

我使用熊猫字符串方法split和正则表达式模式将每一行拆分为单个句子(这会产生不必要的空列表元素-关于如何改进正则表达式的任何建议?).

I use pandas string method split and a regex-pattern to split each row into its single sentences (which produces unnecessary empty list elements - any suggestions on how to improve the regex?).

In:
s = s.str.split(r'([A-Z][^\.!?]*[\.!?])')

Out:
0    [, This is a long text.,  , It has multiple se...
1        [, Do you see?,  , More than one sentence!, ]
2         [, This one has only one sentence though., ]
dtype: object

这会将每一行转换为字符串列表,每个元素包含一个句子.

This converts each row into lists of strings, each element holding one sentence.

现在,我的目标是使用字符串方法contains分别检查每一行中的每个元素以匹配特定的regex模式,并相应地创建一个新的Series来存储返回的布尔值,每个信号都表明regex是否匹配至少一个列表元素.

Now, my goal is to use the string method contains to check each element in each row seperately to match a specific regex pattern and create a new Series accordingly which stores the returned boolean values, each signalizing if the regex matched on at least one of the list elements.

我希望这样:

In:
s.str.contains('you')

Out:
0   False
1   True
2   False

<-第0行的任何元素均不包含'you',但第1行则包含,而第2行则不包含.

<-- Row 0 does not contain 'you' in any of its elements, but row 1 does, while row 2 does not.

但是,执行上述操作时,返回值为

However, when doing the above, the return is

0   NaN
1   NaN
2   NaN
dtype: float64

我还尝试了无法理解的列表理解:

I also tried a list comprehension which does not work:

result = [[x.str.contains('you') for x in y] for y in s]
AttributeError: 'str' object has no attribute 'str'

关于如何实现此目标的任何建议?

Any suggestions on how this can be achieved?

推荐答案

您可以使用python find() 方法

you can use python find() method

>>> s.apply(lambda x : any((i for i in x if i.find('you') >= 0)))
0    False
1     True
2    False
dtype: bool

我猜s.str.contains('you')无法正常工作,因为您系列中的元素不是字符串,而是列表.但是您也可以执行以下操作:

I guess s.str.contains('you') is not working because elements of your series is not strings, but lists. But you can also do something like this:

>>> s.apply(lambda x: any(pd.Series(x).str.contains('you')))
0    False
1     True
2    False

这篇关于在包含字符串列表的系列上使用Pandas字符串方法“包含"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆