为什么NLTK Stemmer输出的词根数量与预期输出不同? [英] Why is the number of stem from NLTK Stemmer outputs different from expected output?

查看:65
本文介绍了为什么NLTK Stemmer输出的词根数量与预期输出不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须对文本执行词干提取.问题如下:

I have to perform Stemming on a text. The questions are as follows :

  1. 标记所有在 tc 中给定的单词.该词应包含字母或数字或下划线.将标记的单词列表存储在 tw
  2. 将所有单词转换为小写.将结果存储到变量 tw
  3. 从唯一的一组 tw 中删除所有停用词.将结果存储到变量 fw
  4. 使用PorterStemmer对存在于 fw 中的每个单词进行词根分析,并将结果存储在列表中 psw
  1. Tokenize all the words given in tc. The word should contain alphabets or numbers or underscore. Store the tokenized list of words in tw
  2. Convert all the words into lowercase. Store the result into the variable tw
  3. Remove all the stop words from the unique set of tw. Store the result into the variable fw
  4. Stem each word present in fw with PorterStemmer, and store the result in the list psw

下面是我的代码:

import re
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem  import PorterStemmer,LancasterStemmer

pattern = r'\w+';
tw= nltk.regexp_tokenize(tc,pattern);
tw= [word.lower() for word in tw];
stop_word = set(stopwords.words('english'));
fw= [w for w in tw if not w in stop_word];
#print(sorted(filteredwords));
porter = PorterStemmer();
psw = [porter.stem(word) for word in fw];
print(sorted(psw));

我的代码可以与所有提供的测试用例完美地结合使用,但仅在以下测试用例中失败,

My code works perfectly with all the provided testcases in hand-on but it fails only for the below test case where

tc =上周我无意间去了See's Candy(我在商场里寻找电话维修),事实证明,See's Candy现在即使是最简单的东西也要收取1美元-全额费用.他们的小甜食产品.我买了两个巧克力棒棒糖和两个巧克力焦糖杏仁东西.总费用是四分钱左右.我的意思是,糖果非常好吃,但让我们成为现实:士力架酒吧的价格是50美分.在这个每糖果美元"的启示之后,我可能不会很快发现自己梦dream以求地回到"See's Candy"中."

tc = "I inadvertently went to See's Candy last week (I was in the mall looking for phone repair), and as it turns out, See's Candy now charges a dollar -- a full dollar -- for even the simplest of their wee confection offerings. I bought two chocolate lollipops and two chocolate-caramel-almond things. The total cost was four-something. I mean, the candies were tasty and all, but let's be real: A Snickers bar is fifty cents. After this dollar-per-candy revelation, I may not find myself wandering dreamily back into a See's Candy any time soon."

我的输出是:

['杏仁','后背','酒吧','购买','candi','candi','焦糖','cent','charg','chocol',完美",成本",美元","dreamili",偶数","fifti","find",四个",满","indadvert",最后一个",让",棒棒糖",外观",购物中心",可能",平均",提供",每",电话",真实",维修",狂欢",看到",最简单",昵称",某物",很快","tasti",事物",时间",总计",转身",两个",徘徊",凌晨",一周",去了]]

['almond', 'back', 'bar', 'bought', 'candi', 'candi', 'caramel', 'cent', 'charg', 'chocol', 'confect', 'cost', 'dollar', 'dreamili', 'even', 'fifti', 'find', 'four', 'full', 'inadvert', 'last', 'let', 'lollipop', 'look', 'mall', 'may', 'mean', 'offer', 'per', 'phone', 'real', 'repair', 'revel', 'see', 'simplest', 'snicker', 'someth', 'soon', 'tasti', 'thing', 'time', 'total', 'turn', 'two', 'wander', 'wee', 'week', 'went']

预期输出为:

[杏仁",后方",酒吧",购买","candi","candi","candi" ,"caramel","cent","charg","chocol","confect","cost","dollar","dreamili","even","fifti","find","four","full","inadvert","last","let','lollipop','look','mall','may','mean','offer','per','phone','real','repair','revel','see',最简单",更刻薄",某物",很快",塔斯蒂",事物",时间",总计",转身",两个",流浪",凌晨",周",去了"]

['almond', 'back', 'bar', 'bought', 'candi', 'candi', 'candi', 'caramel', 'cent', 'charg', 'chocol', 'confect', 'cost', 'dollar', 'dreamili', 'even', 'fifti', 'find', 'four', 'full', 'inadvert', 'last', 'let', 'lollipop', 'look', 'mall', 'may', 'mean', 'offer', 'per', 'phone', 'real', 'repair', 'revel', 'see', 'simplest', 'snicker', 'someth', 'soon', 'tasti', 'thing', 'time', 'total', 'turn', 'two', 'wander', 'wee', 'week', 'went']

不同之处在于"Candi"的出现

The difference is the occurrence of 'Candi'

寻求帮助以解决问题.

推荐答案

这是因为糖果"一词的标题和小写

It is because of the Title case and lower case of the word "Candy"

我上周无意中去了See's Candy(我在商场里寻找电话维修),事实证明,See's Candy现在收取一美元(一整美元)的费用,即使是他们最简单的小甜品供品.我买了两个巧克力棒棒糖和两个巧克力焦糖杏仁东西.总费用是四分钱左右.我的意思是,糖果非常好吃,但让我们成为现实:士力架酒吧的价格是50美分.在每笔糖果曝光之后,我可能不会很快发现自己梦dream以求地回到See's Candy.

I inadvertently went to See's Candy last week (I was in the mall looking for phone repair), and as it turns out, See's Candy now charges a dollar -- a full dollar -- for even the simplest of their wee confection offerings. I bought two chocolate lollipops and two chocolate-caramel-almond things. The total cost was four-something. I mean, the candies were tasty and all, but let's be real: A Snickers bar is fifty cents. After this dollar-per-candy revelation, I may not find myself wandering dreamily back into a See's Candy any time soon.

这篇关于为什么NLTK Stemmer输出的词根数量与预期输出不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆