NLTK/pyNLTK可以“按语言"工作吗? (即非英语),以及如何? [英] can NLTK/pyNLTK work "per language" (i.e. non-english), and how?

查看:79
本文介绍了NLTK/pyNLTK可以“按语言"工作吗? (即非英语),以及如何?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何告诉NLTK以特定语言处理文本?

How can I tell NLTK to treat the text in a particular language?

偶尔,我会编写一个专门的NLP例程,以在非英语(但仍是北欧)文本域上进行POS标记,标记等操作.

Once in a while I write a specialized NLP routine to do POS tagging, tokenizing and etc. on a non-english (but still hindo-European) text domain.

这个问题似乎只针对不同的语料,而不是代码/设置的更改: 德语中的POS标记

This question seem to address only different corpora, not the change in code/settings: POS tagging in German

或者,是否有专门用于python的希伯来语/西班牙语/波兰语NLP模块?

Alternatively,are there any specialized Hebrew/Spanish/Polish NLP modules for python?

推荐答案

我不确定您指的是代码/设置的更改. NLTK主要依靠机器学习,通常从训练数据中提取设置".

I'm not sure what you're referring to as the changes in code/settings. NLTK mostly relies on machine learning and the "settings" are usually extracted from the training data.

当涉及到POS标记时,结果和标记将取决于您使用/训练的标记器.如果您要自己训练,那么您当然需要一些西班牙语/波兰语的训练数据.这些可能很难找到的原因是缺乏公开可用的金标准材料.有一些工具可以执行此操作,但是此工具不适用于python(

When it comes to POS tagging the results and tagging will be dependant on the tagger you use/train. Should you train your own you'll of course need some spanish / polish training data. The reason these might be hard to find is the lack of gold standard material publicly available. There are tools out there to do that do this, but this one isn't for python (http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/).

nltk.tokenize.punkt.PunktSentenceTokenizer标记生成器将根据多语言句子边界对句子进行标记化,有关详细信息,请参见本文(

The nltk.tokenize.punkt.PunktSentenceTokenizer tokenizer will tokenize sentences according to multilingual sentence boundaries the details of which can be found in this paper (http://www.mitpressjournals.org/doi/abs/10.1162/coli.2006.32.4.485).

这篇关于NLTK/pyNLTK可以“按语言"工作吗? (即非英语),以及如何?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆