Nltk单词分词器将结尾的单引号视为一个单独的单词 [英] Nltk word tokenizer treats ending single quote as a separate word
问题描述
以下是IPython笔记本中的代码段:
Here's a code snippet from the IPython notebook:
test = "'v'"
words = word_tokenize(test)
words
输出为:
["'v", "'"]
如您所见,结尾的单引号被视为一个单独的单词,而第一个是"v"的一部分.我想拥有
As you can see the ending single quote is treated as a separate word, while the first one is a part of "v". I want to have
["'v'"]
或
["'", "v", "'"]
有什么办法可以做到这一点?
Is there any way to achieve this?
推荐答案
尝试from nltk.tokenize.moses import MosesTokenizer, MosesDetokenizer
from nltk.tokenize.moses import MosesTokenizer, MosesDetokenizer
t, d = MosesTokenizer(), MosesDetokenizer()
tokens = t.tokenize(test)
tokens
[''v'']
其中'
= '
您还可以使用escape=False
参数来防止转义XML特殊字符:
You can also use the escape=False
arguments to prevent the escaping of XML special character:
>>> m.tokenize("'v'", escape=False)
["'v'"]
保持'v'
的输出与原始的Moses令牌生成器,即
The output to keep the 'v'
is consistent with the original Moses tokenizer, i.e.
~/mosesdecoder/scripts/tokenizer$ perl tokenizer.perl -l en < x
Tokenizer Version 1.1
Language: en
Number of threads: 1
'v'
如果您想探索并处理以下内容,则有其他标记器.也是单引号.
There are other tokenizers if you wish to explore and have handling of single quotes too.
这篇关于Nltk单词分词器将结尾的单引号视为一个单独的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!