可伸缩的解决方案,包含带有pandas中字符串列表的str.contains [英] Scalable solution for str.contains with list of strings in pandas

查看:67
本文介绍了可伸缩的解决方案,包含带有pandas中字符串列表的str.contains的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析包含字符串对象行的pandas数据框df1.我有一个关键字参考列表,需要删除df1中包含参考列表中任何单词的每一行.

I am parsing a pandas dataframe df1 containing string object rows. I have a reference list of keywords and need to delete every row in df1 containing any word from the reference list.

当前,我是这样的:

reference_list: ["words", "to", "remove"]
df1 = df1[~df1[0].str.contains(r"words")]
df1 = df1[~df1[0].str.contains(r"to")]
df1 = df1[~df1[0].str.contains(r"remove")]

不能将其扩展到数千个单词.但是,当我这样做时:

df1 = df1[~df1[0].str.contains(reference_word for reference_word in reference_list)]

我产生错误第一个参数必须是字符串或编译模式.

以下解决方案,我尝试了:

Following this solution, I tried:

reference_list: "words|to|remove" 
df1 = df1[~df1[0].str.contains(reference_list)]

不会引发异常,但不会解析所有单词.

Which doesn't raise an exception but doesn't parse all words eather.

如何有效地使用带有单词列表的str.contains?

推荐答案

对于可伸缩解决方案,请执行以下操作-

For a scalable solution, do the following -

  1. 通过正则表达式或管道|
  2. 连接单词的内容
  3. 将此内容传递给str.contains
  4. 使用结果过滤df1
  1. join the contents of words by the regex OR pipe |
  2. pass this to str.contains
  3. use the result to filter df1

要为第0 列建立索引,请不要使用df1[0](因为这可能被认为是模棱两可的).最好使用lociloc(请参见下文).

To index the 0th column, don't use df1[0] (as this might be considered ambiguous). It would be better to use loc or iloc (see below).

words = ["words", "to", "remove"]
mask = df1.iloc[:, 0].str.contains(r'\b(?:{})\b'.format('|'.join(words)))
df1 = df1[~mask]

注意:如果words是系列,这也将起作用.

Note: This will also work if words is a Series.

或者,如果您的第0 列仅是单词(不是句子)列,则可以使用df.isin,它应该更快-

Alternatively, if your 0th column is a column of words only (not sentences), then you can use df.isin, which should be faster -

df1 = df1[~df1.iloc[:, 0].isin(words)]

这篇关于可伸缩的解决方案,包含带有pandas中字符串列表的str.contains的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆