Python取消标记一个句子 [英] Python Untokenize a sentence

查看:131
本文介绍了Python取消标记一个句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于如何标记句子的指南太多,但是我却没有找到相反的方法.

There are so many guides on how to tokenize a sentence, but i didn't find any on how to do the opposite.

 import nltk
 words = nltk.word_tokenize("I've found a medicine for my disease.")
 result I get is: ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']

除了将标记化的句子还原为原始状态外,还有什么功能吗?函数tokenize.untokenize()由于某种原因不起作用.

Is there any function than reverts the tokenized sentence to the original state. The function tokenize.untokenize() for some reason doesn't work.

我知道我可以做到这一点,这也许可以解决问题,但我很好奇是否为此提供了集成功能:

I know that I can do for example this and this probably solves the problem but I am curious is there an integrated function for this:

result = ' '.join(sentence).replace(' , ',',').replace(' .','.').replace(' !','!')
result = result.replace(' ?','?').replace(' : ',': ').replace(' \'', '\'')   

推荐答案

您可以使用"treebank detokenizer"-TreebankWordDetokenizer:

You can use "treebank detokenizer" - TreebankWordDetokenizer:

from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'


nltk中也有MosesDetokenizer,但由于许可问题而被删除,但可以作为 Sacremoses独立程序包使用.


There is also MosesDetokenizer which was in nltk but got removed because of the licensing issues, but it is available as a Sacremoses standalone package.

这篇关于Python取消标记一个句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆