NLTK/pyNLTK 可以“按语言"工作吗?(即非英语),以及如何? [英] can NLTK/pyNLTK work "per language" (i.e. non-english), and how?

查看:18
本文介绍了NLTK/pyNLTK 可以“按语言"工作吗?(即非英语),以及如何?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何告诉 NLTK 处理特定语言的文本?

How can I tell NLTK to treat the text in a particular language?

偶尔我会编写一个专门的 NLP 例程,在非英语(但仍然是印欧语系)文本域上进行 POS 标记、标记等.

Once in a while I write a specialized NLP routine to do POS tagging, tokenizing and etc. on a non-english (but still hindo-European) text domain.

这个问题似乎只针对不同的语料库,而不是代码/设置的变化:德语 POS 标记

This question seem to address only different corpora, not the change in code/settings: POS tagging in German

或者,是否有任何专门用于 Python 的希伯来语/西班牙语/波兰语 NLP 模块?

Alternatively,are there any specialized Hebrew/Spanish/Polish NLP modules for python?

推荐答案

我不确定您所说的代码/设置更改是指什么.NLTK 主要依赖机器学习,设置"通常从训练数据中提取.

I'm not sure what you're referring to as the changes in code/settings. NLTK mostly relies on machine learning and the "settings" are usually extracted from the training data.

当涉及到 POS 标记时,结果和标记将取决于您使用/训练的标记器.如果您自己训练,您当然需要一些西班牙语/波兰语训练数据.这些可能很难找到的原因是缺乏公开可用的黄金标准材料.有一些工具可以做到这一点,但这个不适用于 python (http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/).

When it comes to POS tagging the results and tagging will be dependant on the tagger you use/train. Should you train your own you'll of course need some spanish / polish training data. The reason these might be hard to find is the lack of gold standard material publicly available. There are tools out there to do that do this, but this one isn't for python (http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/).

nltk.tokenize.punkt.PunktSentenceTokenizer 分词器将根据多语言句子边界对句子进行分词,详细信息见本文(http://www.mitpressjournals.org/doi/abs/10.1162/coli.2006.32.4.485).

The nltk.tokenize.punkt.PunktSentenceTokenizer tokenizer will tokenize sentences according to multilingual sentence boundaries the details of which can be found in this paper (http://www.mitpressjournals.org/doi/abs/10.1162/coli.2006.32.4.485).

这篇关于NLTK/pyNLTK 可以“按语言"工作吗?(即非英语),以及如何?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆