可伸缩的解决方案,包含带有pandas中字符串列表的str.contains [英] Scalable solution for str.contains with list of strings in pandas
问题描述
我正在解析包含字符串对象行的pandas数据框df1
.我有一个关键字参考列表,需要删除df1
中包含参考列表中任何单词的每一行.
I am parsing a pandas dataframe df1
containing string object rows. I have a reference list of keywords and need to delete every row in df1
containing any word from the reference list.
当前,我是这样的:
reference_list: ["words", "to", "remove"]
df1 = df1[~df1[0].str.contains(r"words")]
df1 = df1[~df1[0].str.contains(r"to")]
df1 = df1[~df1[0].str.contains(r"remove")]
不能将其扩展到数千个单词.但是,当我这样做时:
df1 = df1[~df1[0].str.contains(reference_word for reference_word in reference_list)]
我产生错误第一个参数必须是字符串或编译模式.
以下此解决方案,我尝试了:
Following this solution, I tried:
reference_list: "words|to|remove"
df1 = df1[~df1[0].str.contains(reference_list)]
不会引发异常,但不会解析所有单词.
Which doesn't raise an exception but doesn't parse all words eather.
如何有效地使用带有单词列表的str.contains?
推荐答案
对于可伸缩解决方案,请执行以下操作-
For a scalable solution, do the following -
- 通过正则表达式或管道
|
连接单词的内容
- 将此内容传递给
str.contains
- 使用结果过滤
df1
- join the contents of words by the regex OR pipe
|
- pass this to
str.contains
- use the result to filter
df1
要为第0 列建立索引,请不要使用df1[0]
(因为这可能被认为是模棱两可的).最好使用loc
或iloc
(请参见下文).
To index the 0th column, don't use df1[0]
(as this might be considered ambiguous). It would be better to use loc
or iloc
(see below).
words = ["words", "to", "remove"]
mask = df1.iloc[:, 0].str.contains(r'\b(?:{})\b'.format('|'.join(words)))
df1 = df1[~mask]
注意:如果words
是系列,这也将起作用.
Note: This will also work if words
is a Series.
或者,如果您的第0 列仅是单词(不是句子)列,则可以使用df.isin
,它应该更快-
Alternatively, if your 0th column is a column of words only (not sentences), then you can use df.isin
, which should be faster -
df1 = df1[~df1.iloc[:, 0].isin(words)]
这篇关于可伸缩的解决方案,包含带有pandas中字符串列表的str.contains的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!