在数据框的列中找到列表中的任何单词 [英] Find any word of a list in the column of dataframe

查看:79
本文介绍了在数据框的列中找到列表中的任何单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含4783个元素的单词negative列表.我想使用以下代码

I have a list of words negative that has 4783 elements. I want to use the following code

tweets3 = tweets2[tweets2['full_text'].str.contains('|'.join(negative))]

但是,它会给出类似error: multiple repeat at position 4193的错误.

But, it gives ane error like this error: multiple repeat at position 4193.

我不明白此错误.显然,如果我在str.contains中使用一个单词,例如str.contains("deal"),我就能得到结果.

I do not understand this error. Apparently, if I use a single word in str.contains such as str.contains("deal") I am able to get results.

我需要的是一个新的数据框,该数据框仅包含那些行,这些行包含在数据框tweets2full_text中出现的任何单词.

All I need is a new dataframe that carries only those rows which carry any of the words occuring in the dataframe tweets2 column full_text.

作为选择,我还要查看是否可以为0 or 1的当前值和不存在的值设置boolean列.

As a matter of choice I would also like to see if I can have a boolean column for present and absent values as 0 or 1.

我在@ wp78de的帮助下使用了以下代码:

I arrived at using the following code with the help of @wp78de:

tweets2['negative'] = tweets2.loc[tweets2['full_text'].str.contains(r'(?:{})'.format('|'.join(negative)), regex=True, na=False)].copy()

推荐答案

对于其中可能包含正则表达式元字符的任意文字字符串,可以使用re.escape()函数.沿着这条线应该就足够了:

For arbitrary literal strings that may have regular expression metacharacters in it you can use the re.escape() function. Something along this line should be sufficient:

.str.contains(r'(?:{})'.format(re.escape('|'.join(words)), regex=True, na=False)]

这篇关于在数据框的列中找到列表中的任何单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆