使用NLTK标记阿拉伯语单词 [英] Tokenization of Arabic words using NLTK

查看:157
本文介绍了使用NLTK标记阿拉伯语单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 NLTK word_tokenizer将句子拆分成单词.

I'm using NLTK word_tokenizer to split a sentence into words.

我想标记这句话:

في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء 

我正在编写的代码是:

import re
import nltk

lex = u" في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء"

wordsArray = nltk.word_tokenize(lex)
print " ".join(wordsArray)

问题在于word_tokenize函数没有按单词拆分.相反,它按字母拆分,因此输出为:

The problem is that the word_tokenize function doesn't split by words. Instead, it splits by letters so that the output is:

"ف ي _ ب ي ت ن ا ك ل ش ي ل م ا ت ح ت ا ج ه ي ض ي ع ... ا د و ر ع ل ى ش ا ح ن ف ج أ ة ي خ ت ف ي .. ل د ر ج ة ا ن ي ا س و ي ن ف س ي ا د و ر ش ي ء"

有什么想法吗?

到目前为止我已经达到的目标:

What I've reached so far:

通过尝试此处中的文本,它似乎被字母标记.但是,其他令牌生成器也正确地对其进行了令牌化.这是否意味着word_tokenize仅适用于英语?这对大多数NLTK功能都适用吗?

By trying the text in here, it appeared to be tokenized by letters. Also, however, other tokenizers tokenized it correctly. Does that mean that word_tokenize is for English only? Does that go for most of NLTK functions?

推荐答案

我总是建议使用nltk.tokenize.wordpunct_tokenize.您可以在 http://text-processing.com/demo/tokenize/上试用许多NLTK标记器,然后自己看看.

I always recommend using nltk.tokenize.wordpunct_tokenize. You can try out many of the NLTK tokenizers at http://text-processing.com/demo/tokenize/ and see for yourself.

这篇关于使用NLTK标记阿拉伯语单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆