Python NLTK中的Unicode标记 [英] Unicode Tagging in Python NLTK

查看:162
本文介绍了Python NLTK中的Unicode标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究python NLTK标记程序. 我的输入文件是印地语文本,其中包含多行. 在标记文本并使用pos_tag时,我得到的输出仅带有NN标记. 但是使用英语句子作为输入时,它会正确地进行标记. 请帮助. 版本-Python 3.4.1,来自NLTK 3.0文档

I am working on a python NLTK tagging program. My input file is Hindi text containing several lines. On tokenizing the text and using pos_tag the output I get is with NN tag only. but with English sentence as input it does proper tagging. Kindly Help. Version - Python 3.4.1, from NLTK 3.0 documentation

请帮助! 这是我尝试过的.

Kindly help! here is what I tried.

word_to_be_tagged = u""".

word_to_be_tagged = u"ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात."

from nltk.corpus import indian

train_data = indian.tagged_sents('hindi.pos')[:300] 
test_data = indian.tagged_sents('hindi.pos')[301:] 

print(word_to_be_tagged)
print (train_data)

和我得到的输出是不同的.

and the output I get is different.

ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात.
[[('पूर्ण', 'JJ'), ('प्रतिबंध', 'NN'), ('हटाओ', 'VFM'), (':', 'SYM'), ('इराक', 'NNP')], [('संयुक्त', 'NNC'), ('राष्ट्र', 'NN'), ('।', 'SYM')], ...]

推荐答案

问题是您应该使用印地语POS Tagger:

The problem is that you should use hindi POS Tagger:

from nltk.corpus import indian
from nltk.tag import tnt

train_data = indian.tagged_sents('hindi.pos')
tnt_pos_tagger = tnt.TnT()
tnt_pos_tagger.train(train_data) #Training the tnt Part of speech tagger with hindi data

print tnt_pos_tagger.tag(nltk.word_tokenize(word_to_be_tagged))

问题在于,词性标注器在特定领域(主要是语言和主题的组合)中是准确的.用英语,标记器尚未看到的大多数单词都是名词(NN),它仅使用NN标记您的数据.

The problem is that a Part Of Speech tagger is accurate in a specific domain (mostly combination of language and topic). In English, most of the words the tagger haven't seen yet are Nouns (NN), it tags you data with NN only.

如果您在(印地语)之后使用要标记的相同域训练它,那应该没问题.

If you train it with the same domain you want it to tag after (Hindi), it should be OK.

有关更多说明,请参见.

See this for more explanations.

这篇关于Python NLTK中的Unicode标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆