使用spacy删除停用词 [英] removing stop words using spacy

查看：558 发布时间：2020/10/16 20:10:23 python nlp spacy python-3.7 data-cleaning

本文介绍了使用spacy删除停用词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在清理数据框中的一列 Sumcription，并试图做3件事：

I am cleaning a column in my data frame, Sumcription, and am trying to do 3 things:

令牌化

去私密

删除停用词

Tokenize
Lemmantize
Remove stop words

import spacy        
nlp = spacy.load('en_core_web_sm', parser=False, entity=False)        
df['Tokens'] = df.Sumcription.apply(lambda x: nlp(x))    
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS        
spacy_stopwords.add('attach')
df['Lema_Token']  = df.Tokens.apply(lambda x: " ".join([token.lemma_ for token in x if token not in spacy_stopwords]))

但是，例如，当我打印时：

However, when I print for example:

df.Lema_Token.iloc[8]

输出中仍带有附加词：
因为很酷而将海报贴在墙上

The output still has the word attach in it: attach poster on the wall because it is cool

为什么不删除停用词？

我也尝试过这样：

df['Lema_Token_Test']  = df.Tokens.apply(lambda x: [token.lemma_ for token in x if token not in spacy_stopwords])

但是str 附件仍然出现。

But the str attach still appears.

推荐答案

import spacy
import pandas as pd

# Load spacy model
nlp = spacy.load('en', parser=False, entity=False)        

# New stop words list 
customize_stop_words = [
    'attach'
]

# Mark them as stop words
for w in customize_stop_words:
    nlp.vocab[w].is_stop = True


# Test data
df = pd.DataFrame( {'Sumcription': ["attach poster on the wall because it is cool",
                                   "eating and sleeping"]})

# Convert each row into spacy document and return the lemma of the tokens in 
# the document if it is not a sotp word. Finally join the lemmas into as a string
df['Sumcription_lema'] = df.Sumcription.apply(lambda text: 
                                          " ".join(token.lemma_ for token in nlp(text) 
                                                   if not token.is_stop))

print (df)

输出：

   Sumcription                                   Sumcription_lema
0  attach poster on the wall because it is cool  poster wall cool
1                           eating and sleeping         eat sleep

这篇关于使用spacy删除停用词的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用spacy删除停用词 [英] removing stop words using spacy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用spacy删除停用词 [英] removing stop words using spacy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭