NLTK:lemmatizer和pos_tag [英] NLTK: lemmatizer and pos_tag

查看:130
本文介绍了NLTK:lemmatizer和pos_tag的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我构建了一个Plaintext-Corpus,下一步是对所有文本进行lemmat化.我正在使用 WordNetLemmatizer ,并且每个令牌都需要 pos_tag ,这样才能避免出现以下问题:爱-> lemma =爱与爱-> lemma =爱...

I build a Plaintext-Corpus and the next step is to lemmatize all my texts. I'm using the WordNetLemmatizer and need the pos_tag for each token in order to do not get the Problem that e.g. loving -> lemma = loving and love -> lemma = love...

我认为默认的WordNetLemmatizer-POS-Tag为n(=名词),但是如何使用pos_tag?我认为预期的WordNetLemmatizer-POS-Tag与我得到的pos_tag不同.有功能或可以帮助我的东西吗?!?!

The default WordNetLemmatizer-POS-Tag is n (=Noun) i think, but how can i use the pos_tag? I think the expected WordNetLemmatizer-POS-Tag are diffrent to the pos_tag i get. Is there a function or something that can help me?!?!

在这一行中,我认为word_pos是错误的,这就是错误原因

in this line i think the word_pos is wrong and that's the error-reason

lemma = wordnet_lemmatizer.lemmatize(word,word_pos)

lemma = wordnet_lemmatizer.lemmatize(word,word_pos)

import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk import sent_tokenize, word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

corpus_root = 'C:\\Users\\myname\\Desktop\\TestCorpus'
lyrics = PlaintextCorpusReader(corpus_root,'.*')

for fileid in lyrics.fileids():
     tokens = word_tokenize(lyrics.raw(fileid))
     tagged_tokens = pos_tag(tokens)
     for tagged_token in tagged_tokens:
         word = tagged_token[0]
         word_pos = tagged_token[1]
         print(tagged_token[0])
         print(tagged_token[1])
         lemma = wordnet_lemmatizer.lemmatize(word,pos=word_pos)
         print(lemma)


其他问题: pos_tag是否足以进行我的词根识别或需要其他标记器?我的文字是歌词...


Additional Question: Is the pos_tag enough for my lemmatization or need i another tagger? My texts are lyrics...

推荐答案

您需要将标记从pos_tagger转换为wordnet可以识别的四个语法类别"之一,然后将其作为word_pos传递给词机.

You need to convert the tag from the pos_tagger to one of the four "syntactic categories" that wordnet recognizes, then pass that to the lemmatizer as the word_pos.

来自文档:

语法类别:n代表名词文件,v代表动词文件,a代表形容词文件,r代表副词文件.

Syntactic category: n for noun files, v for verb files, a for adjective files, r for adverb files.

这篇关于NLTK:lemmatizer和pos_tag的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆