用 pandas 删除停用词 [英] Stopword removal with pandas

查看:169
本文介绍了用 pandas 删除停用词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从数据帧的列中删除停用词. 列内有需要拆分的文本.

I would like to remove stopwords from a column of a data frame. Inside the column there is text which needs to be splitted.

例如,我的数据框如下所示:

For example my data frame looks like this:

ID   Text
1    eat launch with me
2    go outside have fun

我想在text column上应用停用词,因此应将其拆分.

I want to apply stopword on text column so it should be splitted.

我尝试过:

for item in cached_stop_words:
    if item in df_from_each_file[['text']]:
        print(item)
        df_from_each_file['text'] = df_from_each_file['text'].replace(item, '')

所以我的输出应该是这样的:

So my output should be like this:

ID   Text
1    eat launch 
2    go fun

这意味着停用词已被删除. 但它不能正常工作.反之亦然,我尝试将数据帧设置为序列,然后循环遍历,但我也没有用.

It means stopwords have been deleted. but it does not work correctly. I also tried vice versa in a way make my data frame as series and then loop through that, but iy also did not work.

感谢您的帮助.

推荐答案

replace(本身)在这里不太适合,因为您要执行 partial 字符串替换.您需要基于正则表达式的替换.

replace (by itself) isn't a good fit here, because you want to perform partial string replacement. You want regex based replacement.

一个简单的解决方案是使用str.replace.当您使用的停用词数量可控时.

One simple solution, when you have a manageable number of stop words, is using str.replace.

p = re.compile("({})".format('|'.join(map(re.escape, cached_stop_words))))
df['Text'] = df['Text'].str.lower().str.replace(p, '')

df
   ID               Text
0   1       eat launch  
1   2   outside have fun

如果性能很重要,请使用列表理解.

If performance is important, use a list comprehension.

cached_stop_words = set(cached_stop_words)
df['Text'] = [' '.join([w for w in x.lower().split() if w not in cached_stop_words]) 
    for x in df['Text'].tolist()]

df
   ID              Text
0   1        eat launch
1   2  outside have fun

这篇关于用 pandas 删除停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆