nltk标记化和收缩 [英] nltk tokenization and contractions

查看：77 发布时间：2020/5/18 0:51:51 python nlp nltk

本文介绍了nltk标记化和收缩的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在用nltk标记文本，只是将句子加到wordpunct_tokenizer中.这会分割收缩(例如，不要"转换为不要" +" +不")，但我想将它们保留为一个词.我正在改进我的方法，以便对文本进行更精确的度量，因此我需要深入研究nltk令牌化模块，而不是简单的令牌化.

I'm tokenizing text with nltk, just sentences fed to wordpunct_tokenizer. This splits contractions (e.g. 'don't' to 'don' +" ' "+'t') but I want to keep them as one word. I'm refining my methods for a more measured and precise tokenization of text, so I need to delve deeper into the nltk tokenization module beyond simple tokenization.

我猜这很普遍，我想从其他以前可能不得不处理特定问题的人那里得到反馈.

I'm guessing this is common and I'd like feedback from others who've maybe had to deal with the particular issue before.

是的，这是我所知道的一般性问题，

Yeah this a general, splattershot question I know

另外，作为nlp的新手，我是否需要担心收缩?

Also, as a novice to nlp, do I need to worry about contractions at all?

SExprTokenizer或TreeBankWordTokenizer似乎可以满足我现在的需求.

The SExprTokenizer or TreeBankWordTokenizer seems to do what I'm looking for for now.

nltk标记化和收缩 [英] nltk tokenization and contractions

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

nltk标记化和收缩 [英] nltk tokenization and contractions

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭