Python NLTK中的Unicode标记 [英] Unicode Tagging in Python NLTK
问题描述
我正在研究python NLTK标记程序. 我的输入文件是印地语文本,其中包含多行. 在标记文本并使用pos_tag时,我得到的输出仅带有NN标记. 但是使用英语句子作为输入时,它会正确地进行标记. 请帮助. 版本-Python 3.4.1,来自NLTK 3.0文档
I am working on a python NLTK tagging program. My input file is Hindi text containing several lines. On tokenizing the text and using pos_tag the output I get is with NN tag only. but with English sentence as input it does proper tagging. Kindly Help. Version - Python 3.4.1, from NLTK 3.0 documentation
请帮助! 这是我尝试过的.
Kindly help! here is what I tried.
word_to_be_tagged = u""".
word_to_be_tagged = u"ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात."
from nltk.corpus import indian
train_data = indian.tagged_sents('hindi.pos')[:300]
test_data = indian.tagged_sents('hindi.pos')[301:]
print(word_to_be_tagged)
print (train_data)
和我得到的输出是不同的.
and the output I get is different.
ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात.
[[('पूर्ण', 'JJ'), ('प्रतिबंध', 'NN'), ('हटाओ', 'VFM'), (':', 'SYM'), ('इराक', 'NNP')], [('संयुक्त', 'NNC'), ('राष्ट्र', 'NN'), ('।', 'SYM')], ...]
推荐答案
问题是您应该使用印地语POS Tagger:
The problem is that you should use hindi POS Tagger:
from nltk.corpus import indian
from nltk.tag import tnt
train_data = indian.tagged_sents('hindi.pos')
tnt_pos_tagger = tnt.TnT()
tnt_pos_tagger.train(train_data) #Training the tnt Part of speech tagger with hindi data
print tnt_pos_tagger.tag(nltk.word_tokenize(word_to_be_tagged))
问题在于,词性标注器在特定领域(主要是语言和主题的组合)中是准确的.用英语,标记器尚未看到的大多数单词都是名词(NN),它仅使用NN标记您的数据.
The problem is that a Part Of Speech tagger is accurate in a specific domain (mostly combination of language and topic). In English, most of the words the tagger haven't seen yet are Nouns (NN), it tags you data with NN only.
如果您在(印地语)之后使用要标记的相同域训练它,那应该没问题.
If you train it with the same domain you want it to tag after (Hindi), it should be OK.
有关更多说明,请参见此.
See this for more explanations.
这篇关于Python NLTK中的Unicode标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!