一个可行的高棉单词拆分解决方案? [英] A Viable Solution for Word Splitting Khmer?

查看:115
本文介绍了一个可行的高棉单词拆分解决方案?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究一种解决方案,将高棉语(柬埔寨语)的长行分割成单个单词(使用UTF-8).高棉语在单词之间不使用空格.有一些解决方案,但远远不够(此处在这里),而那些项目已经被淘汰了.

I am working on a solution to split long lines of Khmer (the Cambodian language) into individual words (in UTF-8). Khmer does not use spaces between words. There are a few solutions out there, but they are far from adequate (here and here), and those projects have fallen by the wayside.

以下是高棉语的一个示例行,需要对其进行拆分(它们可以比此更长):

Here is a sample line of Khmer that needs to be split (they can be longer than this):

ចូរសរសើរដល់ទ្រង់ដែលទ្រង់បានប្រទានការទាំងអស់នោះមកដល់រូបអ្នកដោយព្រោះអង្គព្រះយេស៊ូវ

ចូរសរសើរដល់ទ្រង់ដែលទ្រង់បានប្រទានការទាំងអស់នោះមកដល់រូបអ្នកដោយព្រោះអង្គព្រះយេស៊ូវ ហើយដែលអ្នកមិនអាចរកការទាំងអស់នោះដោយសារការប្រព្រឹត្តរបស់អ្នកឡើយ។

创建一个可行的解决方案来分割高棉单词的目的是双重的:它将鼓励那些使用高棉传统(非Unicode)字体的人转换为Unicode(这有很多好处),并且将启用传统高棉字体.可以快速导入到Unicode中,以便与拼写检查器一起使用(而不是手动遍历和拆分单词,而对于大型文档来说,这可能会花费很长时间).

The goal of creating a viable solution that splits Khmer words is twofold: it will encourage those who used Khmer legacy (non-Unicode) fonts to convert over to Unicode (which has many benefits), and it will enable legacy Khmer fonts to be imported into Unicode to be used with a spelling checker quickly (rather than manually going through and splitting words which, with a large document, can take a very long time).

我不需要100%的精度,但是速度很重要(特别是因为需要分成高棉文字的行可能会很长). 我乐于接受建议,但是目前我有大量的高棉语单词被正确分割(带有不间断的空格),并且我创建了一个单词概率词典文件(frequency.csv)用作该单词的词典.分词器.

I don't need 100% accuracy, but speed is important (especially since the line that needs to be split into Khmer words can be quite long). I am open to suggestions, but currently I have a large corpus of Khmer words that are correctly split (with a non-breaking space), and I have created a word probability dictionary file (frequency.csv) to use as a dictionary for the word splitter.

我在此处使用了 Viterbi算法,并且据说运行速度很快.

I found this python code here that uses the Viterbi algorithm and it supposedly runs fast.

import re
from itertools import groupby

def viterbi_segment(text):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
                        for j in range(max(0, i - max_word_length), i))
        probs.append(prob_k)
        lasts.append(k)
    words = []
    i = len(text)
    while 0 < i:
        words.append(text[lasts[i]:i])
        i = lasts[i]
    words.reverse()
    return words, probs[-1]

def word_prob(word): return dictionary.get(word, 0) / total
def words(text): return re.findall('[a-z]+', text.lower()) 
dictionary = dict((w, len(list(ws)))
                  for w, ws in groupby(sorted(words(open('big.txt').read()))))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))

我还尝试使用此页面作者的源Java代码:文本分段:基于字典的分词,但是它运行太慢而无用(因为我的单词概率词典有超过10万个词...).

I also tried using the source java code from the author of this page: Text segmentation: dictionary-based word splitting but it ran too slow to be of any use (because my word probability dictionary has over 100k terms...).

这是从没有空格/组合词的文本中检测出最可能出现的词的另一种选择 :

WORD_FREQUENCIES = {
    'file': 0.00123,
    'files': 0.00124,
    'save': 0.002,
    'ave': 0.00001,
    'as': 0.00555
}

def split_text(text, word_frequencies, cache):
    if text in cache:
        return cache[text]
    if not text:
        return 1, []
    best_freq, best_split = 0, []
    for i in xrange(1, len(text) + 1):
        word, remainder = text[:i], text[i:]
        freq = word_frequencies.get(word, None)
        if freq:
            remainder_freq, remainder = split_text(
                    remainder, word_frequencies, cache)
            freq *= remainder_freq
            if freq > best_freq:
                best_freq = freq
                best_split = [word] + remainder
    cache[text] = (best_freq, best_split)
    return cache[text]

print split_text('filesaveas', WORD_FREQUENCIES, {})

--> (1.3653e-08, ['file', 'save', 'as'])

在python方面,我是一个新手,对于所有真正的编程人员(网站之外)我都是新手,所以请多多包涵.任何人都有他们认为会很好工作的选择吗?

I am a newbee when it comes to python and I am really new to all real programming (outside of websites), so please bear with me. Does anyone have any options that they feel would work well?

推荐答案

ICU库(具有 Python 和Java绑定)有一个 DictionaryBasedBreakIterator 类为了这.

The ICU library (that has Python and Java bindings) has a DictionaryBasedBreakIterator class that can be used for this.

这篇关于一个可行的高棉单词拆分解决方案?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆