NLTK认为当务之急是名词 [英] NLTK Thinks that Imperatives are Nouns

查看:88
本文介绍了NLTK认为当务之急是名词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在食谱上使用pos_tagger.我遇到的一个问题是pos_tagger返回命令式时态的单词是名词,难道不是动词吗? 例如:

I'm using the pos_tagger on recipes. A problem I'm having is that the pos_tagger returns that words in the imperative tense are nouns, shouldn't they be verbs? For example:

使用输入:

combine 1 1/2 cups floud, 3/4 cup sugar, salt and baking powder

输出为:

[('combine', 'NN'), ('1', 'CD'), ('1/2', 'CD'), ('cups', 'NNS'), ('floud', 'VBD'), (',',      ','), ('3/4', 'CD'), ('cup', 'NN'), ('sugar', 'NN'), (',', ','), ('salt', 'NN'), ('and', 'CC'), ('baking', 'VBG'), ('powder', 'NN')]

这是我正在使用的代码:

Here's the code I'm using for this:

    def part_of_speech(self,input_sentance):
        text = nltk.word_tokenize(input_sentance)
        return nltk.pos_tag(text)

'combine'不会被标记为某种动词吗?这是nltk的错吗?还是我做错了什么?

Shouldn't 'combine' be tagged as some sort of verb? Is this the fault of the nltk? Or am I doing something wrong?

推荐答案

您看到的是传统统计自然语言处理(NLP)中一个非常常见的问题.简而言之,您正在使用标记器的数据看起来与它所训练的数据不一样. NLTK没有记录详细信息,但据我所知,默认标记器已在《华尔街日报》的文章,布朗语料库或两者的某种组合上得到了培训.这些语料库包含很少的命令,因此,当您向其提供命令数据时,它做的事情就不正确.

What you're seeing is a very common problem in traditional statistical natural language processing (NLP). In short, the data you are using the tagger on doesn't look like the data it was trained on. NLTK doesn't document the details, but as far as I know the default tagger is trained on Wall Street Journal articles, the Brown Corpus, or some combination of the two. These corpora contain very few imperatives, so when you give it data with imperatives it doesn't do the right thing.

一个好的长期解决方案是为大量食谱纠正标签,并在更正后的数据上进行训练,这样就可以解决训练数据与测试数据之间不匹配的问题.但是,这是大量的工作.理想情况下,将已经存在一个具有许多必要条件的语料库.我的研究小组对此进行了调查,尽管我们正在生产一个合适的,但我们尚未找到合适的.

A good long-term solution would be to correct the tags for a large corpus of recipes and train on the corrected data, that way you solve the problem of mismatch between the training and testing data. This is, however, a huge amount of work. Ideally, a corpus with a lot of imperatives would already exist; my research group has looked into this and we have not found a suitable one, although we are in the process of producing one.

我在最近的一个项目中使用了一个更简单的解决方案,要求正确理解必要的命令,就是简单地指出您想要什么,然后强制这些单词的标签正确.

A much simpler solution that I've been using on a recent project that required imperatives to be understood correctly is to simply note what the imperatives are that you want, and force the tags for those words to be correct.

因此,在下面的示例中,我制作了一个词典,说"combine"应作为动词处理,然后使用列表理解来更改标签.

So in the example below, I made a dictionary saying that "combine" should be treated as a verb, and then used a list comprehension to change the tags.

tagged_words = [('combine', 'NN'), ('1', 'CD'), ('1/2', 'CD'), ('cups', 'NNS'), ('flour', 'VBD')]
force_tags = {'combine': 'VB'}
new_tagged_words = [(word, force_tags.get(word, tag)) for word, tag in tagged_words]

new_tagged_words的内容现在具有原始标签,但在force_tags中有任何条目的地方都已更改.

The contents of new_tagged_words now has the original tags except changed wherever there was an entry in force_tags.

>>> new_tagged_words
[('combine', 'VB'), ('1', 'CD'), ('1/2', 'CD'), ('cups', 'NNS'), ('flour', 'VBD')]

此解决方案确实需要您说出要强加给动词的单词.这远非理想,但没有更好的通用解决方案.

This solution does require you to say what the words you want to force to verbs are. This is far from ideal, but there isn't a better general solution.

这篇关于NLTK认为当务之急是名词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆