如何标记马拉雅拉姆语单词? [英] How to tokenize a Malayalam word?

查看:103
本文介绍了如何标记马拉雅拉姆语单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

ഇതുഒരുസ്ടലംമാണ്  

原住民

这是Unicode字符串,表示这是一个地方

This is a Unicode string meaning this is a place

import nltk
nltk.wordpunct_tokenize('ഇതുഒരുസ്ഥാലമാണ് '.decode('utf8'))

对我不起作用.

nltk.word_tokenize('ഇതുഒരുസ്ഥാലമാണ് '.decode('utf8'))

也不起作用 其他例子

"കണ്ടില്ല "  = കണ്ടു +ഇല്ല,
"വലിയൊരു"  = വലിയ + ഒരു

右拆分:

ഇത്  ഒരു സ്ഥാലം ആണ് 

输出:

[u'\u0d07\u0d24\u0d4d\u0d12\u0d30\u0d41\u0d38\u0d4d\u0d25\u0d32\u0d02\u0d06\u0d23\u0d4d']

我只需要拆分单词,如另一个示例所示.其他示例部分用于测试.问题不在于Unicode.它与语言的形态有关.为此,您需要使用形态分析仪
看一下这篇论文. http://link.springer.com/chapter/10.1007%2F978-3-642- 27872-3_38

I just need to split the words as shown in the other example. Other example section is for testing.The problem is not with Unicode. It is with morphology of language. for this purpose you need to use a morphological analyzer
Have a look at this paper. http://link.springer.com/chapter/10.1007%2F978-3-642-27872-3_38

推荐答案

维基百科的语言速成课程( http://zh.wikipedia.org/wiki/马拉雅拉姆语),您的问题以及为获得所需输出所需的工具中存在一些问题.

After a crash course of the language from wikipedia (http://en.wikipedia.org/wiki/Malayalam), there are some issues in your question and the tools you've requested for your desired output.

合并的任务

首先,OP将形态分析,分割和标记化的任务混为一谈.通常,对于土耳其语/马拉雅拉姆语这样的通用语言,通常会有很好的区别(请参见 http://en.wikipedia .org/wiki/Agglutinative_language ).

Firstly, the OP conflated the task of morphological analysis, segmentation and tokenization. Often there is a fine distinction especially for aggluntinative languages such as Turkish/Malayalam (see http://en.wikipedia.org/wiki/Agglutinative_language).

凝集性NLP和最佳做法

接下来,我认为tokenizer不适合使用马来亚拉姆语(一种凝集性语言).土耳其语是NLP中研究最多的凝聚语言之一,在令牌化"时采用了不同的策略,他们发现必须使用功能强大的形态分析仪(请参见

Next, I don't think tokenizer is appropriate for Malayalam, an agglutinative language. One of the most studied aggluntinative language in NLP, Turkish have adopted a different strategy when it comes to "tokenization", they found that a full blown morphological analyzer is necessary (see http://www.denizyuret.com/2006/11/turkish-resources.html, www.andrew.cmu.edu/user/ko/downloads/lrec.pdf‎).

单词边界

标记化被定义为从表面文本中识别语言上有意义的单位(LMU)(请参阅

Tokenization is defined as the identification of linguistically meaningful units (LMU) from the surface text (see Why do I need a tokenizer for each language?) And different language would require a different tokenizer to identify the word boundary of different languages. Different people have approach the problem for finding word boundary different but in summary in NLP people have subscribed to the following:

  1. 凝集性语言需要使用某种语言模型训练的一整套成熟的形态分析仪.识别什么是token时通常只有一个层次,这是词素水平,因此NLP社区已经为各自的形态分析工具开发了不同的语言模型.

  1. Agglutinative Languages requires a full blown morphological analyzer trained with some sort of language models. There is often only a single tier when identifying what is token and that is at the morphemic level hence the NLP community had developed different language models for their respective morphological analysis tools.

具有指定单词边界的复合语言可以选择两层tokenization,系统可以首先识别一个孤立的单词,然后在必要时进行词法分析以获取一个独立的单词.更好的谷物代币.粗粒度标记器可以使用某些定界符来分割字符串(例如NLTK的word_tokenizepunct_tokenize使用空格/标点符号表示英语).然后,为了在语素水平上进行更细粒度的分析,人们通常会使用一些有限状态机将单词分解为语素(例如,德语

Polysynthetic Languages with specified word boundary has the choice of a two tier tokenization where the system can first identify an isolated word and then if necessary morphological analysis should be done to obtain a finer grain tokens. A coarse grain tokenizer can split a string using certain delimiter (e.g. NLTK's word_tokenize or punct_tokenize which uses whitespaces/punctuation for English). Then for finer grain analysis at morphemic level, people would usually use some finite state machines to split words up into morpheme (e.g. in German http://canoo.net/services/WordformationRules/Derivation/To-N/N-To-N/Pre+Suffig.html)

没有指定单词边界的多语言" 通常要求分段器首先在标记之间添加空格,因为拼字法不能区分单词边界(例如,中文

Polysynthetic Langauges without specified word boundary often requires a segmenter first to add whitespaces between the tokens because the orthography doesn't differentiate word boundaries (e.g. in Chinese https://code.google.com/p/mini-segmenter/). Then from the delimited tokens, if necessary, morphemic analysis can be done to produce finer grain tokens (e.g. http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html). Often this finer grain tokens are tied with POS tags.

对于OP的请求/问题的简要答复,OP使用了错误的工具完成任务:

The answer in brief to OP's request/question, the OP had used the wrong tools for the task:

  • 要输出Malayalam的tokens,必须使用形态分析仪,NLTK中的简单粗粒标记器将无法工作.
  • NLTK的令牌生成器旨在对具有指定单词边界的多合成语言进行令牌化(例如英语/欧洲语言),因此令牌生成器并非不适用于马拉雅拉姆语,而并非旨在令牌化聚集性语言.
  • 要获得输出,需要为该语言构建一个完整的形态分析仪,并且已经有人为其构建了(aclweb.org/anthology//O/O12/O12-1028.pdf),OP应该联系该工具的作者,如果他/她对该工具感兴趣.
  • 用语言模型来构建形态分析器的时间很短,我鼓励OP首先寻找通用分隔符,该分隔符将语言中的单词分解为词素,然后执行简单的re.split()来实现基线标记器.
  • To output tokens for Malayalam, a morphological analyzer is necessary, simple coarse grain tokenizer in NLTK would not work.
  • NLTK's tokenizer is meant to tokenize polysynthetic Languages with specified word boundary (e.g. English/European languages) so it is not that the tokenizer is not working for Malayalam, it just wasn't meant to tokenize aggluntinative languages.
  • To achieve the output, a full blown morphological analyzer needs to be built for the language and someone had built it (aclweb.org/anthology//O/O12/O12-1028.pdf‎), the OP should contact the author of the paper if he/she is interested in the tool.
  • Short of building a morphological analyzer with a language model, I encourage the OP to first spot for common delimiters that splits words into morphemes in the language and then perform the simple re.split() to achieve a baseline tokenizer.

这篇关于如何标记马拉雅拉姆语单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆