POS标记后使单词缩微词产生意外结果 [英] Lemmatizing words after POS tagging produces unexpected results

查看：115 发布时间：2020/5/18 1:04:04 nlp nltk pos-tagger lemmatization python-3.5

本文介绍了POS标记后使单词缩微词产生意外结果的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在将python3.5与nltk pos_tag函数和WordNetLemmatizer一起使用.我的目标是拼合数据库中的单词以对文本进行分类.我正在尝试使用lemmatizer进行测试，并且在相同令牌上使用POS标记器时遇到奇怪的行为.在下面的示例中，我列出了三个字符串，在POS标记器中运行它们时，所有其他元素都以名词(NN)的形式返回，其余元素以动词(VBG)的形式返回.

I am using python3.5 with the nltk pos_tag function and the WordNetLemmatizer. My goal is to flatten words in our database to classify text. I am trying to test using the lemmatizer and I encounter strange behavior when using the POS tagger on identical tokens. In the example below, I have a list of three strings and when running them in the POS tagger every other element is returned as a noun(NN) and the rest are return as verbs (VBG).

这会影响成词.输出结果如下:

This affects the lemmatization. The out put looks like this:

pos Of token: v
lemmatized token: skydive
pos Of token: n
lemmatized token: skydiving
pos Of token: v
lemmatized token: skydive

如果我将更多元素添加到相同字符串列表中，则该相同模式将继续.我正在使用的完整代码是这样的:

If I add more elements to the list of identical strings this same pattern continues. The code in full I am using is this:

tokens = ['skydiving', 'skydiving', 'skydiving']
lmtzr=WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return 'a'
    elif treebank_tag.startswith('V'):
        return 'v'
    elif treebank_tag.startswith('N'):
        return 'n'
    elif treebank_tag.startswith('R'):
        return 'r'
    elif treebank_tag.startswith('S'):
        return ''
    else:
        return ''

numTokens = (len(tokens))
for i in range(0,numTokens):
    tokens[i]=tokens[i].replace(" ","")

noSpaceTokens = pos_tag(tokens)

for token in noSpaceTokens:
    tokenStr = str(token[1])
    noWhiteSpace = token[0].replace(" ", "")
    preLemmed = get_wordnet_pos(tokenStr)
    print("pos Of token: " + preLemmed)
    lemmed = lmtzr.lemmatize(noWhiteSpace,preLemmed)
    print("lemmatized token: " + lemmed)

推荐答案

简而言之:

在POS标记中，您需要上下文句子，而不是非语法标记的列表.

When POS tagging you need a context sentence not a list of ungrammatical tokens.

例如要对语法句子进行词组化，可以查看 https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L100

E.g. to lemmatize a grammatical sentence, you can take a look at https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L100

在上下文上下文中进行词句化时，获得正确词条的唯一方法是手动指定pos标签.

When lemmatizing out of context sentence, the only way to get the right lemma is to manually specify the pos tags.

例如来自> https://github.com/alvations/pywsd/blob /master/pywsd/utils.py#L66 ，我们必须为lemmatize函数指定pos参数.
否则，它将始终采用n POS，另请参见

E.g. from https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66 , we have to specify the pos parameter for the lemmatize function.
Otherwise it would be always assuming the n POS, see also WordNetLemmatizer not returning the right lemma unless POS is explicit - Python NLTK

冗长:

POS标记器通常只适用于完整的句子，而不适用于单个单词.当您尝试在上下文之外标记单个单词时，您得到的是最常见的标记.

POS tagger usually works on the full sentence and not individual words. When you try to tag a single word out of context, what you get is the most frequent tag.

要验证在标记单个单词(即只有1个单词的句子)时，它始终具有相同的标记:

To verify that when tagging a single word (i.e. a sentence with only 1 word), it always gives the same tag:

>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag
>>> ptb2wn_pos = {'J':'a', 'V':'v', 'N':'n', 'R':'r'}
>>> sent = ['skydive']
>>> most_frequent_tag = pos_tag(sent)[0][1]
>>> most_frequent_tag
'JJ'
>>> most_frequent_tag = ptb2wn_pos[most_frequent_tag[0]]
>>> most_frequent_tag
'a'
>>> for _ in range(1000): assert ptb2wn_pos[pos_tag(sent)[0][1][0]] == most_frequent_tag;
... 
>>>

现在，由于默认情况下如果句子只有1个单词，标记始终为'a'，则WordNetLemmatizer将始终返回skydive:

Now, since the tag is always 'a' by default if the sentence only have 1 word, then the WordNetLemmatizer will always return skydive:

>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize(sent[0], pos=most_frequent_tag)
'skydive'

让我们在句子的上下文中查看单词的引理:

Let's to to see the lemma of a word in context of a sentence:

>>> sent2 = 'They skydrive from the tower yesterday'
>>> pos_tag(sent2.split())
[('They', 'PRP'), ('skydrive', 'VBP'), ('from', 'IN'), ('the', 'DT'), ('tower', 'NN'), ('yesterday', 'NN')]
>>> pos_tag(sent2.split())[1]
('skydrive', 'VBP')
>>> pos_tag(sent2.split())[1][1]
'VBP'
>>> ptb2wn_pos[pos_tag(sent2.split())[1][1][0]]
'v'

因此，当您执行pos_tag时，令牌输入列表的上下文很重要.

So the context of the input list of tokens matters when you do pos_tag.

在您的示例中，您有一个列表['skydiving', 'skydiving', 'skydiving']，这意味着您正在使用pos标记的句子是不合语法的句子:

In your example, you had a list ['skydiving', 'skydiving', 'skydiving'] meaning the sentence that you are pos-tagging is an ungrammatical sentence:

高空跳伞高空跳伞

skydiving skydiving skydiving

pos_tag函数认为这是一个普通句子，因此带有标签:

And the pos_tag function thinks is a normal sentence hence giving the tags:

>>> sent3 = 'skydiving skydiving skydiving'.split()
>>> pos_tag(sent3)
[('skydiving', 'VBG'), ('skydiving', 'NN'), ('skydiving', 'VBG')]

在这种情况下，第一个是动词，第二个词是名词，第三个词是动词，这将返回以下引理(您不希望这样):

In which case the first is a verb, the second word a noun and the third word a verb, which will return the following lemma (which you do not desire):

>>> wnl.lemmatize('skydiving', 'v')
'skydive'
>>> wnl.lemmatize('skydiving', 'n')
'skydiving'
>>> wnl.lemmatize('skydiving', 'v')
'skydive'

因此，如果我们在您的令牌列表中有一个有效的语法句子，则输出看起来可能会非常不同

So if we have a valid grammatical sentence in your list of token, the output might look very different

>>> sent3 = 'The skydiving sport is an exercise that promotes diving from the sky , ergo when you are skydiving , you feel like you are descending to earth .'
>>> pos_tag(sent3.split())
[('The', 'DT'), ('skydiving', 'NN'), ('sport', 'NN'), ('is', 'VBZ'), ('an', 'DT'), ('exercise', 'NN'), ('that', 'IN'), ('promotes', 'NNS'), ('diving', 'VBG'), ('from', 'IN'), ('the', 'DT'), ('sky', 'NN'), (',', ','), ('ergo', 'RB'), ('when', 'WRB'), ('you', 'PRP'), ('are', 'VBP'), ('skydiving', 'VBG'), (',', ','), ('you', 'PRP'), ('feel', 'VBP'), ('like', 'IN'), ('you', 'PRP'), ('are', 'VBP'), ('descending', 'VBG'), ('to', 'TO'), ('earth', 'JJ'), ('.', '.')]

这篇关于POS标记后使单词缩微词产生意外结果的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

POS标记后使单词缩微词产生意外结果 [英] Lemmatizing words after POS tagging produces unexpected results

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

POS标记后使单词缩微词产生意外结果 [英] Lemmatizing words after POS tagging produces unexpected results

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭