如何为 nltk lemmatizers 提供(或生成)标签 [英] How to provide (or generate) tags for nltk lemmatizers

查看:41
本文介绍了如何为 nltk lemmatizers 提供(或生成)标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组文档,我想将它们转换成这样的形式,这样我就可以计算这些文档中单词的 tfidf(这样每个文档都由 tfidf 数字向量表示).

I have a set of documents, and I would like to transform those into such form, that it would allow me to count tfidf for words in those documents (so that each document is being represented by vector of tfidf-numbers).

我认为调用 WordNetLemmatizer.lemmatize(word) 和 PorterStemmer 就足够了——但所有的有"、有"、有"等都没有被词形还原器转换为有"——并且它也适用于其他词.然后我读到,我应该为 lemmatizer 提供一个提示 - 代表单词类型的标签 - 无论是名词、动词、形容词等.

I thought that it is enough to call WordNetLemmatizer.lemmatize(word), and then PorterStemmer - but all 'have', 'has', 'had', etc are not being transformed to 'have' by the lemmatizer - and it goes for other words as well. Then I have read, that I am supposed to provide a hint for the lemmatizer - tag representing a type of the word - whether it is noun, verb, adjective, etc.

我的问题是 - 我如何获得这些标签?我应该在这些文件上执行什么才能得到这个?

My question is - how do I get these tags? What I am supposed to excecute on those documents to get this?

我正在使用 python3.4,并且我一次对单个词进行词形还原 + 词干提取.我尝试了 WordNetLemmatizer 和 nltk 中的 EnglishStemmer,以及 stemming.porter2 中的 stem().

I am using python3.4, and I am lemmatizing + stemming single word at a time. I tried WordNetLemmatizer, and EnglishStemmer from nltk and also stem() from stemming.porter2.

推荐答案

好的,我搜索了更多,我找到了如何获取这些标签.第一个必须做一些预处理,以确保文件将被标记化(在我的情况下,它是关于删除从 pdf 转换为 txt 后留下的一些东西).

Ok, I googled more and I found out how to get these tags. First one have to do some preprocessing, to be sure that file will get tokenized (in my case it was about removing some stuff left off after conversion from pdf to txt).

然后必须将这些文件标记为句子,然后将每个句子转换为单词数组,然后可以通过 nltk 标记器进行标记.这样就可以完成词形还原,然后在其上添加词干.

Then these file has to be tokenized into sentences, then each sentence into word array, and that can be tagged by nltk tagger. With that lemmatization can be done, and then stemming added on top of it.

from nltk.tokenize import sent_tokenize, word_tokenize
# use sent_tokenize to split text into sentences, and word_tokenize to
# to split sentences into words
from nltk.tag import pos_tag
# use this to generate array of tuples (word, tag)
# it can be then translated into wordnet tag as in
# [this response][1]. 
from nltk.stem.wordnet import WordNetLemmatizer
from stemming.porter2 import stem

# code from response mentioned above
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''    


with open(myInput, 'r') as f:
    data = f.read()
    sentences = sent_tokenize(data)
    ignoreTypes = ['TO', 'CD', '.', 'LS', ''] # my choice
    lmtzr = WordNetLemmatizer()
    for sent in sentences:
        words = word_tokenize(sentence)
        tags = pos_tag(words)
        for (word, type) in tags:
            if type in ignoreTypes:
                continue
            tag = get_wordnet_pos(type)
            if tag == '':
                continue
            lema = lmtzr.lemmatize(word, tag)
            stemW = stem(lema)

此时我得到了词干stemW,然后我可以将其写入文件,并使用这些词来计算每个文档的 tfidf 向量.

And at this point I get stemmed word stemW which I can then write to file, and use these to count tfidf vectors per document.

这篇关于如何为 nltk lemmatizers 提供(或生成)标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆