为什么每种语言都需要分词器? [英] Why do I need a tokenizer for each language?

查看:30
本文介绍了为什么每种语言都需要分词器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在处理文本时,为什么需要专门针对该语言的分词器?

用空格进行标记还不够吗?在哪些情况下不宜仅使用空格标记化?

解决方案

标记化是从表面文本中识别出具有语言意义的单位 (LMU).

<块引用>

中文:如果您晚上只能在新加坡前往娱乐场所,Zouk 必然是您的不二之选.p>

简体中文:如果你只有时间去新加坡的一个俱乐部,那一定是 Zouk.

印度尼西亚语:Jika Anda hanya memiliki waktu untuk satu klub di Singapura, pergilah ke Zouk.

日文:シンガポールで一つしかクラブに行く时间がなかったとしたら、このズークに行くべきです.

韩语: 싱가포르에서 클럽 한 군데밖에 갈시간이 없다면, Zouk를 선택하세요.

越南语:Nếu bạn chỉ có thời gian ghé thăm một câu lạc bộ ở Singapore thì hãy đến Zouk.

文字来源:) 或错误的 LMU(s)(例如 http://vdict.com/gian,2,0,0.html).因此,一个适当的越南分词器会输出 thời_gian 作为一个分词,而不是 thờigian.

对于某些其他语言,它们的正字法可能没有空格来分隔单词".或令牌",例如中文、日文,有时还有韩文.在这种情况下,标记化对于计算机识别 LMU 是必要的.通常情况下,LMU 会附加语素/变形,因此有时 形态分析器 在自然语言处理中比分词器更有用.

When processing text, why would one need a tokenizer specialized for the language?

Wouldn't tokenizing by whitespace be enough? What are the cases where it is not good idea to use simply a white space tokenization?

解决方案

Tokenization is the identification of linguistically meaningful units (LMU) from the surface text.

Chinese: 如果您在新加坡只能前往一间夜间娱乐场所,Zouk必然是您的不二之选。

English: If you only have time for one club in Singapore, then it simply has to be Zouk.

Indonesian: Jika Anda hanya memiliki waktu untuk satu klub di Singapura, pergilah ke Zouk.

Japanese: シンガポールで一つしかクラブに行く時間がなかったとしたら、このズークに行くべきです。

Korean: 싱가포르에서 클럽 한 군데밖에 갈시간이 없다면, Zouk를 선택하세요.

Vietnamese: Nếu bạn chỉ có thời gian ghé thăm một câu lạc bộ ở Singapore thì hãy đến Zouk.

Text Source: http://aclweb.org/anthology/Y/Y11/Y11-1038.pdf

The tokenized version of the parallel text above should look like this:

For English, it's simple because each LMU is delimited/separated by whitespaces. However in other languages, it might not be the case. For most romanized languages, such as Indonesian, they have the same whitespace delimiter that can easily identify a LMU.

However, sometimes an LMU is a combination of two "words" separated by spaces. E.g. in the Vietnamese sentence above, you have to read thời_gian (it means time in English) as one token and not 2 tokens. Separating the two words into 2 tokens yields no LMU (e.g. http://vdict.com/th%E1%BB%9Di,2,0,0.html) or wrong LMU(s) (e.g. http://vdict.com/gian,2,0,0.html). Hence a proper Vietnamese tokenizer would output thời_gian as one token rather than thời and gian.

For some other languages, their orthographies might have no spaces to delimit "words" or "tokens", e.g. Chinese, Japanese and sometimes Korean. In that case, tokenization is necessary for computer to identify LMU. Often there are morphemes/inflections attached to an LMU, so sometimes a morphological analyzer is more useful than a tokenizer in Natural Language Processing.

这篇关于为什么每种语言都需要分词器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆