Nltk单词分词器将结尾的单引号视为一个单独的单词 [英] Nltk word tokenizer treats ending single quote as a separate word

查看:165
本文介绍了Nltk单词分词器将结尾的单引号视为一个单独的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是IPython笔记本中的代码段:

Here's a code snippet from the IPython notebook:

test = "'v'"
words = word_tokenize(test)
words

输出为:

["'v", "'"]

如您所见,结尾的单引号被视为一个单独的单词,而第一个是"v"的一部分.我想拥有

As you can see the ending single quote is treated as a separate word, while the first one is a part of "v". I want to have

["'v'"]

["'", "v", "'"]

有什么办法可以做到这一点?

Is there any way to achieve this?

推荐答案

尝试from nltk.tokenize.moses import MosesTokenizer, MosesDetokenizer

from nltk.tokenize.moses import MosesTokenizer, MosesDetokenizer
t, d = MosesTokenizer(), MosesDetokenizer()
tokens = t.tokenize(test)
tokens
[''v'']

其中' = '

您还可以使用escape=False参数来防止转义XML特殊字符:

You can also use the escape=False arguments to prevent the escaping of XML special character:

>>> m.tokenize("'v'", escape=False)
["'v'"]


保持'v'的输出与


The output to keep the 'v' is consistent with the original Moses tokenizer, i.e.

~/mosesdecoder/scripts/tokenizer$ perl tokenizer.perl -l en < x
Tokenizer Version 1.1
Language: en
Number of threads: 1
&apos;v&apos;

如果您想探索并处理以下内容,则有其他标记器.也是单引号.

There are other tokenizers if you wish to explore and have handling of single quotes too.

这篇关于Nltk单词分词器将结尾的单引号视为一个单独的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆