使用spacy删除停用词 [英] removing stop words using spacy
本文介绍了使用spacy删除停用词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在清理数据框中的一列
Sumcription,并试图做3件事:
I am cleaning a column in my data frame
, Sumcription, and am trying to do 3 things:
- 令牌化
- 去私密
-
删除停用词
- Tokenize
- Lemmantize
Remove stop words
import spacy
nlp = spacy.load('en_core_web_sm', parser=False, entity=False)
df['Tokens'] = df.Sumcription.apply(lambda x: nlp(x))
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
spacy_stopwords.add('attach')
df['Lema_Token'] = df.Tokens.apply(lambda x: " ".join([token.lemma_ for token in x if token not in spacy_stopwords]))
但是,例如,当我打印时:
However, when I print for example:
df.Lema_Token.iloc[8]
输出中仍带有附加词:
因为很酷而将海报贴在墙上
The output still has the word attach in it:
attach poster on the wall because it is cool
为什么不删除停用词?
我也尝试过这样:
df['Lema_Token_Test'] = df.Tokens.apply(lambda x: [token.lemma_ for token in x if token not in spacy_stopwords])
但是str 附件
仍然出现。
But the str attach
still appears.
推荐答案
import spacy
import pandas as pd
# Load spacy model
nlp = spacy.load('en', parser=False, entity=False)
# New stop words list
customize_stop_words = [
'attach'
]
# Mark them as stop words
for w in customize_stop_words:
nlp.vocab[w].is_stop = True
# Test data
df = pd.DataFrame( {'Sumcription': ["attach poster on the wall because it is cool",
"eating and sleeping"]})
# Convert each row into spacy document and return the lemma of the tokens in
# the document if it is not a sotp word. Finally join the lemmas into as a string
df['Sumcription_lema'] = df.Sumcription.apply(lambda text:
" ".join(token.lemma_ for token in nlp(text)
if not token.is_stop))
print (df)
输出:
Sumcription Sumcription_lema
0 attach poster on the wall because it is cool poster wall cool
1 eating and sleeping eat sleep
这篇关于使用spacy删除停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文