一个可行的高棉单词拆分解决方案? [英] A Viable Solution for Word Splitting Khmer?

查看：115 发布时间：2020/5/18 0:39:27 python nlp word-boundary text-segmentation southeast-asian-languages

本文介绍了一个可行的高棉单词拆分解决方案?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在研究一种解决方案，将高棉语(柬埔寨语)的长行分割成单个单词(使用UTF-8).高棉语在单词之间不使用空格.有一些解决方案，但远远不够(此处和在这里)，而那些项目已经被淘汰了.

I am working on a solution to split long lines of Khmer (the Cambodian language) into individual words (in UTF-8). Khmer does not use spaces between words. There are a few solutions out there, but they are far from adequate (here and here), and those projects have fallen by the wayside.

以下是高棉语的一个示例行，需要对其进行拆分(它们可以比此更长):

Here is a sample line of Khmer that needs to be split (they can be longer than this):

ចូរសរសើរដល់ទ្រង់ដែលទ្រង់បានប្រទានការទាំងអស់នោះមកដល់រូបអ្នកដោយព្រោះអង្គព្រះយេស៊ូវ

ចូរសរសើរដល់ទ្រង់ដែលទ្រង់បានប្រទានការទាំងអស់នោះមកដល់រូបអ្នកដោយព្រោះអង្គព្រះយេស៊ូវ ហើយដែលអ្នកមិនអាចរកការទាំងអស់នោះដោយសារការប្រព្រឹត្តរបស់អ្នកឡើយ។

创建一个可行的解决方案来分割高棉单词的目的是双重的:它将鼓励那些使用高棉传统(非Unicode)字体的人转换为Unicode(这有很多好处)，并且将启用传统高棉字体.可以快速导入到Unicode中，以便与拼写检查器一起使用(而不是手动遍历和拆分单词，而对于大型文档来说，这可能会花费很长时间).

The goal of creating a viable solution that splits Khmer words is twofold: it will encourage those who used Khmer legacy (non-Unicode) fonts to convert over to Unicode (which has many benefits), and it will enable legacy Khmer fonts to be imported into Unicode to be used with a spelling checker quickly (rather than manually going through and splitting words which, with a large document, can take a very long time).

我不需要100％的精度，但是速度很重要(特别是因为需要分成高棉文字的行可能会很长). 我乐于接受建议，但是目前我有大量的高棉语单词被正确分割(带有不间断的空格)，并且我创建了一个单词概率词典文件(frequency.csv)用作该单词的词典.分词器.

I don't need 100% accuracy, but speed is important (especially since the line that needs to be split into Khmer words can be quite long). I am open to suggestions, but currently I have a large corpus of Khmer words that are correctly split (with a non-breaking space), and I have created a word probability dictionary file (frequency.csv) to use as a dictionary for the word splitter.

我在此处使用了 Viterbi算法，并且据说运行速度很快.

I found this python code here that uses the Viterbi algorithm and it supposedly runs fast.

import re
from itertools import groupby

def viterbi_segment(text):
    probs, lasts = [1.0], [0]
    for i in range(1, len(text) + 1):
        prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
                        for j in range(max(0, i - max_word_length), i))
        probs.append(prob_k)
        lasts.append(k)
    words = []
    i = len(text)
    while 0 < i:
        words.append(text[lasts[i]:i])
        i = lasts[i]
    words.reverse()
    return words, probs[-1]

def word_prob(word): return dictionary.get(word, 0) / total
def words(text): return re.findall('[a-z]+', text.lower()) 
dictionary = dict((w, len(list(ws)))
                  for w, ws in groupby(sorted(words(open('big.txt').read()))))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))

我还尝试使用此页面作者的源Java代码:文本分段:基于字典的分词，但是它运行太慢而无用(因为我的单词概率词典有超过10万个词...).

I also tried using the source java code from the author of this page: Text segmentation: dictionary-based word splitting but it ran too slow to be of any use (because my word probability dictionary has over 100k terms...).

这是从没有空格/组合词的文本中检测出最可能出现的词的另一种选择 :

WORD_FREQUENCIES = {
    'file': 0.00123,
    'files': 0.00124,
    'save': 0.002,
    'ave': 0.00001,
    'as': 0.00555
}

def split_text(text, word_frequencies, cache):
    if text in cache:
        return cache[text]
    if not text:
        return 1, []
    best_freq, best_split = 0, []
    for i in xrange(1, len(text) + 1):
        word, remainder = text[:i], text[i:]
        freq = word_frequencies.get(word, None)
        if freq:
            remainder_freq, remainder = split_text(
                    remainder, word_frequencies, cache)
            freq *= remainder_freq
            if freq > best_freq:
                best_freq = freq
                best_split = [word] + remainder
    cache[text] = (best_freq, best_split)
    return cache[text]

print split_text('filesaveas', WORD_FREQUENCIES, {})

--> (1.3653e-08, ['file', 'save', 'as'])

在python方面，我是一个新手，对于所有真正的编程人员(网站之外)我都是新手，所以请多多包涵.任何人都有他们认为会很好工作的选择吗?

I am a newbee when it comes to python and I am really new to all real programming (outside of websites), so please bear with me. Does anyone have any options that they feel would work well?

一个可行的高棉单词拆分解决方案? [英] A Viable Solution for Word Splitting Khmer?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

一个可行的高棉单词拆分解决方案? [英] A Viable Solution for Word Splitting Khmer?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭