在包含字符串列表的系列上使用Pandas字符串方法“包含" [英] Use Pandas string method 'contains' on a Series containing lists of strings
问题描述
给出一个简单的Pandas系列,其中包含一些字符串,这些字符串可以包含一个以上的句子:
Given a simple Pandas Series that contains some strings which can consist of more than one sentence:
In:
import pandas as pd
s = pd.Series(['This is a long text. It has multiple sentences.','Do you see? More than one sentence!','This one has only one sentence though.'])
Out:
0 This is a long text. It has multiple sentences.
1 Do you see? More than one sentence!
2 This one has only one sentence though.
dtype: object
我使用熊猫字符串方法split
和正则表达式模式将每一行拆分为单个句子(这会产生不必要的空列表元素-关于如何改进正则表达式的任何建议?).
I use pandas string method split
and a regex-pattern to split each row into its single sentences (which produces unnecessary empty list elements - any suggestions on how to improve the regex?).
In:
s = s.str.split(r'([A-Z][^\.!?]*[\.!?])')
Out:
0 [, This is a long text., , It has multiple se...
1 [, Do you see?, , More than one sentence!, ]
2 [, This one has only one sentence though., ]
dtype: object
这会将每一行转换为字符串列表,每个元素包含一个句子.
This converts each row into lists of strings, each element holding one sentence.
现在,我的目标是使用字符串方法contains
分别检查每一行中的每个元素以匹配特定的regex模式,并相应地创建一个新的Series来存储返回的布尔值,每个信号都表明regex是否匹配至少一个列表元素.
Now, my goal is to use the string method contains
to check each element in each row seperately to match a specific regex pattern and create a new Series accordingly which stores the returned boolean values, each signalizing if the regex matched on at least one of the list elements.
我希望这样:
In:
s.str.contains('you')
Out:
0 False
1 True
2 False
<-第0行的任何元素均不包含'you'
,但第1行则包含,而第2行则不包含.
<-- Row 0 does not contain 'you'
in any of its elements, but row 1 does, while row 2 does not.
但是,执行上述操作时,返回值为
However, when doing the above, the return is
0 NaN
1 NaN
2 NaN
dtype: float64
我还尝试了无法理解的列表理解:
I also tried a list comprehension which does not work:
result = [[x.str.contains('you') for x in y] for y in s]
AttributeError: 'str' object has no attribute 'str'
关于如何实现此目标的任何建议?
Any suggestions on how this can be achieved?
推荐答案
您可以使用python find()
方法
you can use python find()
method
>>> s.apply(lambda x : any((i for i in x if i.find('you') >= 0)))
0 False
1 True
2 False
dtype: bool
我猜s.str.contains('you')
无法正常工作,因为您系列中的元素不是字符串,而是列表.但是您也可以执行以下操作:
I guess s.str.contains('you')
is not working because elements of your series is not strings, but lists. But you can also do something like this:
>>> s.apply(lambda x: any(pd.Series(x).str.contains('you')))
0 False
1 True
2 False
这篇关于在包含字符串列表的系列上使用Pandas字符串方法“包含"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!