如何在列表中的句子列表中找到每个单词的引理和频率计数? [英] How to find the lemmas and frequency count of each word in list of sentences in a list?
问题描述
我想使用 WordNet Lemmatizer 找出引理,并且我需要计算每个词频.
I want to find out the lemmas using WordNet Lemmatizer and also I need to compute each word frequency.
我收到以下错误.
跟踪如下:
类型错误:不可散列的类型:'list'
TypeError: unhashable type: 'list'
注意:语料库在 nltk
包本身中可用.
Note: The corpus is available on the nltk
package itself.
到目前为止我尝试过的内容如下:
What I have tried so far is as follows:
import nltk, re
import string
from collections import Counter
from string import punctuation
from nltk.tokenize import TweetTokenizer, sent_tokenize, word_tokenize
from nltk.corpus import gutenberg, stopwords
from nltk.stem import WordNetLemmatizer
def remove_punctuation(from_text):
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in from_text]
return stripped
def preprocessing():
raw_data = (gutenberg.raw('shakespeare-hamlet.txt'))
tokens_sentences = sent_tokenize(raw_data)
tokens = [[word.lower() for word in line.split()] for line in tokens_sentences]
print(len(tokens))
global stripped_tokens
stripped_tokens = [remove_punctuation(i) for i in tokens]
sw = (stopwords.words('english'))
filter_set = [[token for token in sentence if (token.lower() not in sw and token.isalnum())] for sentence in stripped_tokens]
lemma= WordNetLemmatizer()
global lem
lem = []
for w in filter_set:
lem.append(lemma.lemmatize(w))
preprocessing()
请帮助我解决问题.
推荐答案
问题是 lemma.lemmatize
需要一个 string
而你正在传递一个 list代码>.
filter_set
的元素是 lists
.您需要更改行:
The problem is that lemma.lemmatize
expects a string
and you are passing a list
. The elements of filter_set
are lists
. You need to change the line:
lem.append(lemma.lemmatize(w))
像这样:
lem.append([wi for wi in map(lemma.lemmatize, w)])
以上代码将 lemma.lemmatize 应用于 w
中的每个标记 (wi
).完整代码:
The above code applies lemma.lemmatize to each token (wi
) in w
. Full code:
import nltk, re
import string
from collections import Counter
from string import punctuation
from nltk.tokenize import TweetTokenizer, sent_tokenize, word_tokenize
from nltk.corpus import gutenberg, stopwords
from nltk.stem import WordNetLemmatizer
def remove_punctuation(from_text):
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in from_text]
return stripped
def preprocessing():
raw_data = (gutenberg.raw('shakespeare-hamlet.txt'))
tokens_sentences = sent_tokenize(raw_data)
tokens = [[word.lower() for word in line.split()] for line in tokens_sentences]
print(len(tokens))
stripped_tokens = [remove_punctuation(i) for i in tokens]
sw = (stopwords.words('english'))
filter_set = [[token for token in sentence if (token.lower() not in sw and token.isalnum())] for sentence in
stripped_tokens]
lemma = WordNetLemmatizer()
lem = []
for w in filter_set:
lem.append([wi for wi in map(lemma.lemmatize, w)])
return lem
result = preprocessing()
for e in result[:10]: # take the first 10 results
print(e)
输出
['tragedie', 'hamlet', 'william', 'shakespeare', '1599', 'actus', 'primus']
['scoena', 'prima']
['enter', 'barnardo', 'francisco', 'two', 'centinels']
['barnardo']
['who']
['fran']
['nay', 'answer', 'stand', 'vnfold', 'selfe', 'bar']
['long', 'liue', 'king', 'fran']
['barnardo']
['bar']
更新
要获得频率,您可以使用 Counter
:
To get the frequencies you can use Counter
:
result = preprocessing()
frequencies = Counter(word for sentence in result for word in sentence)
for word, frequency in frequencies.most_common(10): # get the 10 most frequent words
print(word, frequency)
输出
ham 337
lord 217
king 180
haue 175
come 127
let 107
shall 107
hamlet 107
thou 105
good 98
这篇关于如何在列表中的句子列表中找到每个单词的引理和频率计数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!