NLTK词法分解器,提取有意义的词 [英] NLTK Lemmatizer, Extract meaningful words

查看:89
本文介绍了NLTK词法分解器,提取有意义的词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前,我将创建一个基于机器学习的代码,该代码可以自动映射类别.

Currently, I am going to create a machine learning based code that automatically maps categories.

在那之前,我将进行自然语言处理.

I am going to do natural language processing before that.

有几个单词列表.

      sent ='The laughs you two heard were triggered 
             by memories of his own high j-flying 
             moist moisture moisturize moisturizing '.lower().split()

我编写了以下代码. 我引用了这个网址. NLTK:lemmatizer和pos_tag

I made the following code. I referenced this url. NLTK: lemmatizer and pos_tag

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
def lemmatize_all(sentence):
    wnl = WordNetLemmatizer()
    for word, tag in pos_tag(word_tokenize(sentence)):
        if tag.startswith("NN"):
            yield wnl.lemmatize(word, pos='n')
        elif tag.startswith('VB'):
            yield wnl.lemmatize(word, pos='v')
        elif tag.startswith('JJ'):
            yield wnl.lemmatize(word, pos='a')



words = ' '.join(lemmatize_all(' '.join(sent)))

结果值如下所示.

laugh heard be trigger memory own high j-flying moist moisture moisturize moisturizing

我对以下结果感到满意.

I am satisfied with the following results.

laughs -> laugh 
were -> be
triggered -> trigger 
memories -> memory 
moist -> moist 

但是,不满足以下值.

heard -> heard 
j-flying -> j-flying 
moisture -> moisture 
moisturize -> moisturize 
moisturizing -> moisturizing 

尽管它比初始值要好,但我希望得到以下结果.

Although it was better than the initial values, I would like the following results.

heard -> hear
j-flying -> fly
moisture -> moist
moisturize -> moist
moisturizing -> moist

如果您还有其他提取有意义的单词的好方法, 请让我知道. 谢谢

If you have any other good way to extract meaningful words, please let me know. Thank you

推荐答案

TL; DR

当您使用的词条矫正器解决另一个问题时,这是一个词条矫正器无法满足您的期望的XY问题.

TL;DR

It's an XY problem of a lemmatizer failing to meet your expectation, when the lemmatizer you're using is to solved a different problem.

问:什么是引理?

语言学中的词法化(或词义化)是将单词的变体形式组合在一起的过程,以便可以将它们作为单词的词缀或字典形式识别的单个项目进行分析. -维基百科

问:什么是字典形式"?

NLTK使用的是morphy算法,该算法使用WordNet作为字典形式"的基础.

NLTK is using the morphy algorithm which is using WordNet as the basis of "dictionary forms"

另请参见 spacy lemmatizer如何工作?.注意SpaCy引入了其他技巧来处理更多不规则单词.

See also How does spacy lemmatizer works?. Note SpaCy has additional hacks put in to handle more irregular words.

问:为什么使用moisture -> moisturemoisturizing -> moisturizing?

Q: Why moisture -> moisture and moisturizing -> moisturizing?

因为存在用于保湿"和保湿"的同义词集(字典形式"之类)

Because there are synset (sort of "dictionary form") for "moisture" and "moisturizing"

>>> from nltk.corpus import wordnet as wn

>>> wn.synsets('moisture')
[Synset('moisture.n.01')]
>>> wn.synsets('moisture')[0].definition()
'wetness caused by water'

>>> wn.synsets('moisturizing')
[Synset('humidify.v.01')]
>>> wn.synsets('moisturizing')[0].definition()
'make (more) humid'

问:如何获得moisture -> moist?

Q: How could I get moisture -> moist?

不太有用.但是,也许可以尝试使用词干分析器(但不要期望太多)

Not really useful. But maybe try a stemmer (but don't expect too much of it)

>>> from nltk.stem import PorterStemmer

>>> porter = PorterStemmer()
>>> porter.stem("moisture")
'moistur'

>>> porter.stem("moisturizing")
'moistur'

问:那我怎么得到moisuturizing/moisuture -> moist?!

Q: Then how do I get moisuturizing/moisuture -> moist?!!

没有充分的方法可以做到这一点.但是在尝试这样做之前,moisuturizing/moisuture -> moist的最终目的是什么.

There's no well-founded way to do that. But before even trying to do that, what is the eventual purpose of doing moisuturizing/moisuture -> moist.

真的有必要这样做吗?

Is it really necessary to do that?

如果您确实想要,可以尝试使用单词向量并尝试查找最相似的单词,但是单词向量附带了其他一些警告.

If you really want, you can try word vectors and try to look for most similar words but there's a whole other world of caveats that comes with word vectors.

问:请稍等,但heard -> heard太荒谬了!!

Q: Wait a minute but heard -> heard is ridiculous?!

是的,POS标记器未正确标记听到的声音.很有可能是因为该句子不是正确的句子,所以POS标签对于该句子中的单词是错误的:

Yeah, the POS tagger isn't tagging the heard correctly. Most probably because the sentence is not a proper sentence, so the POS tags are wrong for the words in the sentence:

>>> from nltk import word_tokenize, pos_tag
>>> sent
'The laughs you two heard were triggered by memories of his own high j-flying moist moisture moisturize moisturizing.'

>>> pos_tag(word_tokenize(sent))
[('The', 'DT'), ('laughs', 'NNS'), ('you', 'PRP'), ('two', 'CD'), ('heard', 'NNS'), ('were', 'VBD'), ('triggered', 'VBN'), ('by', 'IN'), ('memories', 'NNS'), ('of', 'IN'), ('his', 'PRP$'), ('own', 'JJ'), ('high', 'JJ'), ('j-flying', 'NN'), ('moist', 'NN'), ('moisture', 'NN'), ('moisturize', 'VB'), ('moisturizing', 'NN'), ('.', '.')]

我们看到heard被标记为NNS(一个名词).如果我们将其形容为动词:

We see that heard is tagged as NNS (a noun). If we lemmatized it as a verb:

>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('heard', pos='v')
'hear'

问:那我如何获得正确的POS标签?!

可能是使用SpaCy,您会得到('heard', 'VERB'):

Probably with SpaCy, you get ('heard', 'VERB'):

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> sent
'The laughs you two heard were triggered by memories of his own high j-flying moist moisture moisturize moisturizing.'
>>> doc = nlp(sent)
>>> [(word.text, word.pos_) for word in doc]
[('The', 'DET'), ('laughs', 'VERB'), ('you', 'PRON'), ('two', 'NUM'), ('heard', 'VERB'), ('were', 'VERB'), ('triggered', 'VERB'), ('by', 'ADP'), ('memories', 'NOUN'), ('of', 'ADP'), ('his', 'ADJ'), ('own', 'ADJ'), ('high', 'ADJ'), ('j', 'NOUN'), ('-', 'PUNCT'), ('flying', 'VERB'), ('moist', 'NOUN'), ('moisture', 'NOUN'), ('moisturize', 'NOUN'), ('moisturizing', 'NOUN'), ('.', 'PUNCT')]

但是请注意,在这种情况下,SpaCy的得分为('moisturize', 'NOUN'),而NLTK的得分为('moisturize', 'VB').

But note, in this case, SpaCy got ('moisturize', 'NOUN') and NLTK got ('moisturize', 'VB').

问:但是我不能通过SpaCy获得moisturize -> moist吗?

Q: But can't I get moisturize -> moist with SpaCy?

让我们不要回到定义引理的起点.简而言之:

Lets not go back to the start where we define what is a lemma. In short:

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> sent
'The laughs you two heard were triggered by memories of his own high j-flying moist moisture moisturize moisturizing.'
>>> doc = nlp(sent)
>>> [word.lemma_ for word in doc]
['the', 'laugh', '-PRON-', 'two', 'hear', 'be', 'trigger', 'by', 'memory', 'of', '-PRON-', 'own', 'high', 'j', '-', 'fly', 'moist', 'moisture', 'moisturize', 'moisturizing', '.']

另请参见 spacy lemmatizer如何工作?(再次)

问:好的,很好.我无法获取moisturize -> moist ...并且POS标签对于heard -> hear而言并不完美.但是为什么我不能得到j-flying -> fly?

Q: Okay, fine. I can't get moisturize -> moist... And POS tag is not perfect for heard -> hear. But why can't I get j-flying -> fly?

回到为什么要转换j-flying -> fly 的问题,有一些反例说明为什么您不想分离看起来像化合物的东西.

Back to the question of why do you need to convert j-flying -> fly, there are counter examples of why you wouldn't want to separate something that looks like a compound.

例如:

  • Classical-sounding应该去sound吗?
  • X-fitting应该去fit吗?
  • crash-landing应该去landing吗?
  • Should Classical-sounding go to sound?
  • Should X-fitting go to fit?
  • Should crash-landing go to landing?

取决于您的应用程序的最终目的是什么,将令牌转换为所需的形式可能有必要,也可能没有必要.

Depends on what's the ultimate purpose of your application, converting a token to your desired form may or may not be necessary.

问:那么,提取有意义的单词的好方法是什么?

我听起来像是破纪录,但这取决于您的最终目标是什么?

I sound like a broken record but it depends on what's your ultimate goal?

如果您的目标是真正理解单词的含义,那么您必须问自己一个问题,含义是什么?"

If you goal is really to understand the meaning of words then you have to ask yourself the question, "What is the meaning of meaning?"

单个单词的上下文含义之外吗?还是它具有所有可能出现的上下文中的含义之和.

Does individual word has a meaning out of its context? Or would it have the sum of meanings from all the possible context it could occur in.

Au currant,最先进的技术基本上将所有含义视为一个浮点数数组,而这些浮点数数组之间的比较就是赋予其含义的含义.但这真的是目的还是手段? (双关语意).

Au currant, the state-of-art basically treats all meanings as an array of floats and comparisons between array of floats are what give meaning its meaning. But is that really meaning or just an means to an end? (Pun intended).

问:为什么我的问题多于答案?

欢迎来到起源于哲学(例如计算机科学)的计算语言学世界.自然语言处理通常被称为计算语言学的应用

Welcome to the world of computational linguistics which has its roots from philosophy (like computer science). Natural language processing is commonly known as the application of computational linguistics

问:除词机比词干提取器好吗?

A:没有确定的答案. (c.f. Stemmers vs Lemmatizers )

A: No definite answer. (c.f. Stemmers vs Lemmatizers)

这篇关于NLTK词法分解器,提取有意义的词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆