nltk标记化和收缩 [英] nltk tokenization and contractions

查看:77
本文介绍了nltk标记化和收缩的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用nltk标记文本,只是将句子加到wordpunct_tokenizer中.这会分割收缩(例如,不要"转换为不要" +" +不"),但我想将它们保留为一个词.我正在改进我的方法,以便对文本进行更精确的度量,因此我需要深入研究nltk令牌化模块,而不是简单的令牌化.

I'm tokenizing text with nltk, just sentences fed to wordpunct_tokenizer. This splits contractions (e.g. 'don't' to 'don' +" ' "+'t') but I want to keep them as one word. I'm refining my methods for a more measured and precise tokenization of text, so I need to delve deeper into the nltk tokenization module beyond simple tokenization.

我猜这很普遍,我想从其他以前可能不得不处理特定问题的人那里得到反馈.

I'm guessing this is common and I'd like feedback from others who've maybe had to deal with the particular issue before.

是的,这是我所知道的一般性问题,

Yeah this a general, splattershot question I know

另外,作为nlp的新手,我是否需要担心收缩?

Also, as a novice to nlp, do I need to worry about contractions at all?

SExprTokenizer或TreeBankWordTokenizer似乎可以满足我现在的需求.

The SExprTokenizer or TreeBankWordTokenizer seems to do what I'm looking for for now.

推荐答案

您使用的令牌生成器实际上取决于下一步要执行的操作.正如inspectorG4dget所说,某些词性标记器处理拆分的收缩,在这种情况下,拆分是一件好事.但这也许不是您想要的.要确定哪种标记生成器最好,请考虑下一步所需的内容,然后将文本提交到 http://text-processing .com/demo/tokenize/来查看每个NLTK标记生成器的行为.

Which tokenizer you use really depends on what you want to do next. As inspectorG4dget said, some part-of-speech taggers handle split contractions, and in that case the splitting is a good thing. But maybe that's not what you want. To decide which tokenizer is best, consider what you need for the next step, and then submit your text to http://text-processing.com/demo/tokenize/ to see how each NLTK tokenizer behaves.

这篇关于nltk标记化和收缩的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆