nltk pos标记中阿拉伯语的未知符号 [英] Unknown symbol in nltk pos tagging for Arabic

查看:110
本文介绍了nltk pos标记中阿拉伯语的未知符号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用nltk标记了一些阿拉伯文本

I have used nltk to tokenize some arabic text

但是,我最终得到了

(u'an阿拉伯字符/单词','``') 或者 ('u'阿拉伯字符/单词',':')

(u'an arabic character/word', '``') or (u'an arabic character/word', ':')

但是,它们没有在文档中提供``或:.

However, they do not provide the `` or : in the documentation.

因此,我想找出这是什么

hence i would like to find out what is this

from nltk.toeknize.punkt import PunktWordTokenizer 

z = "أنا تسلق شجرة"
tkn = PunkWordTokenizer
sen = tkn.tokenize(z)
tokens = nltk.pos_tag(sent)

print tokens

推荐答案

默认的NLTK POS标签已针对英文文本进行了培训,并且应该用于英文文本处理,请参见

The default NLTK POS tag is trained on English texts and is supposedly for English text processing, see http://www.nltk.org/_modules/nltk/tag.html. The docs:

An off-the-shelf tagger is available.  It uses the Penn Treebank tagset:

    >>> from nltk.tag import pos_tag  # doctest: +SKIP
    >>> from nltk.tokenize import word_tokenize # doctest: +SKIP
    >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) # doctest: +SKIP
    [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
    'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
    ('.', '.')]

以及pos_tag的代码:

from nltk.data import load


# Standard treebank POS tagger
_POS_TAGGER = 'taggers/maxent_treebank_pos_tagger/english.pickle'
def pos_tag(tokens):
    """
    Use NLTK's currently recommended part of speech tagger to
    tag the given list of tokens.

        >>> from nltk.tag import pos_tag # doctest: +SKIP
        >>> from nltk.tokenize import word_tokenize # doctest: +SKIP
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) # doctest: +SKIP
        [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
        'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
        ('.', '.')]

    :param tokens: Sequence of tokens to be tagged
    :type tokens: list(str)
    :return: The tagged tokens
    :rtype: list(tuple(str, str))
    """
    tagger = load(_POS_TAGGER)
    return tagger.tag(tokens)

这对我来说可以让Stanford工具在Ubuntu 14.4.1上的python中工作:

This works for me to get Stanford tools working in python on Ubuntu 14.4.1:

$ cd ~
$ wget http://nlp.stanford.edu/software/stanford-postagger-full-2015-01-29.zip
$ unzip stanford-postagger-full-2015-01-29.zip
$ wget http://nlp.stanford.edu/software/stanford-segmenter-2015-01-29.zip
$ unzip /stanford-segmenter-2015-01-29.zip
$ python

然后:

from nltk.tag.stanford import POSTagger
path_to_model= '/home/alvas/stanford-postagger-full-2015-01-30/models/arabic.tagger'
path_to_jar = '/home/alvas/stanford-postagger-full-2015-01-30/stanford-postagger-3.5.1.jar'

artagger = POSTagger(path_to_model, path_to_jar, encoding='utf8')
artagger._SEPARATOR = '/'
tagged_sent = artagger.tag(u"أنا تسلق شجرة")
print(tagged_sent)

[输出]:

$ python3 test.py
[('أ', 'NN'), ('ن', 'NN'), ('ا', 'NN'), ('ت', 'NN'), ('س', 'RP'), ('ل', 'IN'), ('ق', 'NN'), ('ش', 'NN'), ('ج', 'NN'), ('ر', 'NN'), ('ة', 'PRP')]

如果在使用Stanford POS标记器时遇到Java问题,请参见DELPH-IN Wiki: http://moin .delph-in.net/ZhongPreprocessing

If you have java problems when using Stanford POS tagger, see DELPH-IN wiki: http://moin.delph-in.net/ZhongPreprocessing

这篇关于nltk pos标记中阿拉伯语的未知符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆