分词高棉语的可行解决方案? [英] A Viable Solution for Word Splitting Khmer?

查看:21
本文介绍了分词高棉语的可行解决方案?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究将高棉语(柬埔寨语言)的长行拆分为单个单词(UTF-8)的解决方案.高棉语在单词之间不使用空格.有一些解决方案,但它们远远不够(这里在这里),而那些项目已被搁置.

I am working on a solution to split long lines of Khmer (the Cambodian language) into individual words (in UTF-8). Khmer does not use spaces between words. There are a few solutions out there, but they are far from adequate (here and here), and those projects have fallen by the wayside.

这是需要拆分的高棉语样本行(它们可以比这更长):

Here is a sample line of Khmer that needs to be split (they can be longer than this):

ចូរសរសើរដល់ទ្រង់ដែលទ្រង់បានប្រទានការទាំងអស់នោះមកដល់រូបអ្នកដោយព្រោះអង្គព្រះយេស៊ូវហើយដែលអ្នកមិនអាចរកការទាំងអស់នោះដោយសារការប្រព្រឹត្តរបស់អ្នកឡើយ.

ចូរសរសើរដល់ទ្រង់ដែលទ្រង់បានប្រទានការទាំងអស់នោះមកដល់រូបអ្នកដោយព្រោះអង្គព្រះយេស៊ូវ ហើយដែលអ្នកមិនអាចរកការទាំងអស់នោះដោយសារការប្រព្រឹត្តរបស់អ្នកឡើយ។

创建一个拆分高棉语单词的可行解决方案的目标有两个:它会鼓励那些使用高棉传统(非 Unicode)字体的人转换为 Unicode(这有很多好处),并且它将启用传统高棉字体导入到 Unicode 中以快速与拼写检查器一起使用(而不是手动检查和拆分单词,对于大型文档,这可能需要很长时间).

The goal of creating a viable solution that splits Khmer words is twofold: it will encourage those who used Khmer legacy (non-Unicode) fonts to convert over to Unicode (which has many benefits), and it will enable legacy Khmer fonts to be imported into Unicode to be used with a spelling checker quickly (rather than manually going through and splitting words which, with a large document, can take a very long time).

我不需要 100% 的准确率,但速度很重要(尤其是因为需要拆分为高棉语单词的行可能很长).我乐于接受建议,但目前我有大量正确拆分的高棉语语料库(带有不间断空格),并且我创建了一个单词概率词典文件 (frequency.csv) 用作以下内容的词典分词器.

I don't need 100% accuracy, but speed is important (especially since the line that needs to be split into Khmer words can be quite long). I am open to suggestions, but currently I have a large corpus of Khmer words that are correctly split (with a non-breaking space), and I have created a word probability dictionary file (frequency.csv) to use as a dictionary for the word splitter.

我在此处找到了使用Viterbi 算法,据说它运行得很快.

I found this python code here that uses the Viterbi algorithm and it supposedly runs fast.

import re
from itertools import groupby

def viterbi_segment(text):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
                        for j in range(max(0, i - max_word_length), i))
        probs.append(prob_k)
        lasts.append(k)
    words = []
    i = len(text)
    while 0 < i:
        words.append(text[lasts[i]:i])
        i = lasts[i]
    words.reverse()
    return words, probs[-1]

def word_prob(word): return dictionary.get(word, 0) / total
def words(text): return re.findall('[a-z]+', text.lower()) 
dictionary = dict((w, len(list(ws)))
                  for w, ws in groupby(sorted(words(open('big.txt').read()))))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))

我也尝试使用此页面作者的源 Java 代码:文本分割: 基于字典的分词,但它运行速度太慢而没有任何用处(因为我的单词概率字典有超过 10 万个术语......).

I also tried using the source java code from the author of this page: Text segmentation: dictionary-based word splitting but it ran too slow to be of any use (because my word probability dictionary has over 100k terms...).

这里是 从没有空格/组合词的文本中检测最可能的词:

WORD_FREQUENCIES = {
    'file': 0.00123,
    'files': 0.00124,
    'save': 0.002,
    'ave': 0.00001,
    'as': 0.00555
}

def split_text(text, word_frequencies, cache):
    if text in cache:
        return cache[text]
    if not text:
        return 1, []
    best_freq, best_split = 0, []
    for i in xrange(1, len(text) + 1):
        word, remainder = text[:i], text[i:]
        freq = word_frequencies.get(word, None)
        if freq:
            remainder_freq, remainder = split_text(
                    remainder, word_frequencies, cache)
            freq *= remainder_freq
            if freq > best_freq:
                best_freq = freq
                best_split = [word] + remainder
    cache[text] = (best_freq, best_split)
    return cache[text]

print split_text('filesaveas', WORD_FREQUENCIES, {})

--> (1.3653e-08, ['file', 'save', 'as'])

我是 Python 的新手,我对所有真正的编程(网站之外)都很陌生,所以请耐心等待.有没有人有任何他们认为行之有效的选择?

I am a newbee when it comes to python and I am really new to all real programming (outside of websites), so please bear with me. Does anyone have any options that they feel would work well?

推荐答案

ICU 库(具有 Python 和 Java 绑定)有一个 DictionaryBasedBreakIterator 类可以使用为了这.

The ICU library (that has Python and Java bindings) has a DictionaryBasedBreakIterator class that can be used for this.

这篇关于分词高棉语的可行解决方案?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆