为什么NLTK Stemmer输出的词根数量与预期输出不同? [英] Why is the number of stem from NLTK Stemmer outputs different from expected output?
问题描述
我必须对文本执行词干提取.问题如下:
I have to perform Stemming on a text. The questions are as follows :
- 标记所有在
tc
中给定的单词.该词应包含字母或数字或下划线.将标记的单词列表存储在tw
中 - 将所有单词转换为小写.将结果存储到变量
tw
- 从唯一的一组
tw
中删除所有停用词.将结果存储到变量fw
- 使用PorterStemmer对存在于
fw
中的每个单词进行词根分析,并将结果存储在列表中psw
- Tokenize all the words given in
tc
. The word should contain alphabets or numbers or underscore. Store the tokenized list of words intw
- Convert all the words into lowercase. Store the result into the variable
tw
- Remove all the stop words from the unique set of
tw
. Store the result into the variablefw
- Stem each word present in
fw
with PorterStemmer, and store the result in the listpsw
下面是我的代码:
import re
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer,LancasterStemmer
pattern = r'\w+';
tw= nltk.regexp_tokenize(tc,pattern);
tw= [word.lower() for word in tw];
stop_word = set(stopwords.words('english'));
fw= [w for w in tw if not w in stop_word];
#print(sorted(filteredwords));
porter = PorterStemmer();
psw = [porter.stem(word) for word in fw];
print(sorted(psw));
我的代码可以与所有提供的测试用例完美地结合使用,但仅在以下测试用例中失败,
My code works perfectly with all the provided testcases in hand-on but it fails only for the below test case where
tc =上周我无意间去了See's Candy(我在商场里寻找电话维修),事实证明,See's Candy现在即使是最简单的东西也要收取1美元-全额费用.他们的小甜食产品.我买了两个巧克力棒棒糖和两个巧克力焦糖杏仁东西.总费用是四分钱左右.我的意思是,糖果非常好吃,但让我们成为现实:士力架酒吧的价格是50美分.在这个每糖果美元"的启示之后,我可能不会很快发现自己梦dream以求地回到"See's Candy"中."
tc = "I inadvertently went to See's Candy last week (I was in the mall looking for phone repair), and as it turns out, See's Candy now charges a dollar -- a full dollar -- for even the simplest of their wee confection offerings. I bought two chocolate lollipops and two chocolate-caramel-almond things. The total cost was four-something. I mean, the candies were tasty and all, but let's be real: A Snickers bar is fifty cents. After this dollar-per-candy revelation, I may not find myself wandering dreamily back into a See's Candy any time soon."
我的输出是:
['杏仁','后背','酒吧','购买','candi','candi','焦糖','cent','charg','chocol',完美",成本",美元","dreamili",偶数","fifti","find",四个",满","indadvert",最后一个",让",棒棒糖",外观",购物中心",可能",平均",提供",每",电话",真实",维修",狂欢",看到",最简单",昵称",某物",很快","tasti",事物",时间",总计",转身",两个",徘徊",凌晨",一周",去了]]
['almond', 'back', 'bar', 'bought', 'candi', 'candi', 'caramel', 'cent', 'charg', 'chocol', 'confect', 'cost', 'dollar', 'dreamili', 'even', 'fifti', 'find', 'four', 'full', 'inadvert', 'last', 'let', 'lollipop', 'look', 'mall', 'may', 'mean', 'offer', 'per', 'phone', 'real', 'repair', 'revel', 'see', 'simplest', 'snicker', 'someth', 'soon', 'tasti', 'thing', 'time', 'total', 'turn', 'two', 'wander', 'wee', 'week', 'went']
预期输出为:
[杏仁",后方",酒吧",购买","candi","candi","candi" ,"caramel","cent","charg","chocol","confect","cost","dollar","dreamili","even","fifti","find","four","full","inadvert","last","let','lollipop','look','mall','may','mean','offer','per','phone','real','repair','revel','see',最简单",更刻薄",某物",很快",塔斯蒂",事物",时间",总计",转身",两个",流浪",凌晨",周",去了"]
['almond', 'back', 'bar', 'bought', 'candi', 'candi', 'candi', 'caramel', 'cent', 'charg', 'chocol', 'confect', 'cost', 'dollar', 'dreamili', 'even', 'fifti', 'find', 'four', 'full', 'inadvert', 'last', 'let', 'lollipop', 'look', 'mall', 'may', 'mean', 'offer', 'per', 'phone', 'real', 'repair', 'revel', 'see', 'simplest', 'snicker', 'someth', 'soon', 'tasti', 'thing', 'time', 'total', 'turn', 'two', 'wander', 'wee', 'week', 'went']
不同之处在于"Candi"的出现
The difference is the occurrence of 'Candi'
寻求帮助以解决问题.
推荐答案
这是因为糖果"一词的标题和小写
It is because of the Title case and lower case of the word "Candy"
我上周无意中去了See's Candy(我在商场里寻找电话维修),事实证明,See's Candy现在收取一美元(一整美元)的费用,即使是他们最简单的小甜品供品.我买了两个巧克力棒棒糖和两个巧克力焦糖杏仁东西.总费用是四分钱左右.我的意思是,糖果非常好吃,但让我们成为现实:士力架酒吧的价格是50美分.在每笔糖果曝光之后,我可能不会很快发现自己梦dream以求地回到See's Candy.
I inadvertently went to See's Candy last week (I was in the mall looking for phone repair), and as it turns out, See's Candy now charges a dollar -- a full dollar -- for even the simplest of their wee confection offerings. I bought two chocolate lollipops and two chocolate-caramel-almond things. The total cost was four-something. I mean, the candies were tasty and all, but let's be real: A Snickers bar is fifty cents. After this dollar-per-candy revelation, I may not find myself wandering dreamily back into a See's Candy any time soon.
这篇关于为什么NLTK Stemmer输出的词根数量与预期输出不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!