如何使用nltk将令牌列表转换为wordnet引理列表? [英] How to convert token list into wordnet lemma list using nltk?

查看:130
本文介绍了如何使用nltk将令牌列表转换为wordnet引理列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有从pdf来源中提取的令牌列表.我能够预处理文本并将其标记化,但是我想遍历标记并将列表中的每个标记转换为wordnet语料库中的引理.因此,我的令牌列表如下所示:

I have a list of tokens extracted out of a pdf source. I am able to pre process the text and tokenize it but I want to loop through the tokens and convert each token in the list to its lemma in the wordnet corpus. So, my tokens list looks like this:

['0000', 'Everyone', 'age', 'remembers', 'Þ', 'rst', 'heard', 'contest', 'I', 'sitting', 'hideout', 'watching', ...]

没有"Everyone","0000",Þ"等词的词缀,还有许多我需要消除的词缀.但是对于年龄",记住",听说"等词,令牌列表应该看起来像:

There's no lemmas of words like 'Everyone', '0000', 'Þ' and many more which I need to eliminate. But for words like 'age', 'remembers', 'heard' etc. the token list is suppose to look like:

['age', 'remember', 'hear', ...]

我正在通过以下代码检查同义词:

I am checking the synonyms through this code:

syns = wn.synsets("heard")
print(syns[0].lemmas()[0].name())

至此,我已经在python中创建了函数clean_text()进行预处理.看起来像:

At this point I have created the function clean_text() in python for preprocessing. That looks like:

def clean_text(text):
    # Eliminating punctuations
    text = "".join([word for word in text if word not in string.punctuation])
    # tokenizing
    tokens = re.split("\W+", text)
    # lemmatizing and removing stopwords
    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
    # converting token list into synset
    syns = [text.lemmas()[0].name() for text in wn.synsets(text)]
    return text

我遇到了错误:

syns = [text.lemmas()[0].name() for text in wn.synsets(text)]
AttributeError: 'list' object has no attribute 'lower'

如何获取每个引理的令牌列表?

How to get the token list for each lemma?

完整代码:

import string
import re
from wordcloud import WordCloud
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
from nltk.corpus import wordnet
import PyPDF4
import matplotlib
import numpy as np
from PIL import Image

stopwords = nltk.corpus.stopwords.words('english')
moreStopwords = ['clin97803078874365pallr1indd'] # additional stopwords to be removed manually.
wn = nltk.WordNetLemmatizer()

data = PyPDF4.PdfFileReader(open('ReadyPlayerOne.pdf', 'rb'))
pageData = ''
for page in data.pages:
    pageData += page.extractText()
# print(pageData)


def clean_text(text):
    text = "".join([word for word in text if word not in string.punctuation])
    tokens = re.split("\W+", text)
    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
    syns = [text.lemmas()[0].name() for text in wordnet.synsets(text)]
    return syns


print(clean_text(pageData))

推荐答案

您正在用单词列表来调用wordnet.synsets(text)(检查此时的text是什么),并且应该使用word来调用它. wordnet.synsets的预处理试图将.lower()应用于其参数,因此会出现错误(AttributeError: 'list' object has no attribute 'lower').

You are calling wordnet.synsets(text) with a list of words (check what is text at that point) and you should call it with a word. The preprocessing of wordnet.synsets is trying to apply .lower() to its parameters and therefore the error (AttributeError: 'list' object has no attribute 'lower').

下面是clean_text的功能版本,已解决此问题:

Below there is a functional version of clean_text with a fix of this problem:

import string
import re
import nltk
from nltk.corpus import wordnet

stopwords = nltk.corpus.stopwords.words('english')
wn = nltk.WordNetLemmatizer()

def clean_text(text):
    text = "".join([word for word in text if word not in string.punctuation])
    tokens = re.split("\W+", text)
    text = [wn.lemmatize(word) for word in tokens if word not in stopwords]
    lemmas = []
    for token in text:
        lemmas += [synset.lemmas()[0].name() for synset in wordnet.synsets(token)]
    return lemmas


text = "The grass was greener."

print(clean_text(text))

返回:

['grass', 'Grass', 'supergrass', 'eatage', 'pot', 'grass', 'grass', 'grass', 'grass', 'grass', 'denounce', 'green', 'green', 'green', 'green', 'fleeceable']

这篇关于如何使用nltk将令牌列表转换为wordnet引理列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆