将CJK音译为拉丁语-最好使用C ++ [英] Transliterate CJK to Latin -- preferably in C++

查看:158
本文介绍了将CJK音译为拉丁语-最好使用C ++的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个可以将CJK音译为拉丁语(即拼音,罗马字等)的程序.例如,您提供中文,日文或韩文文档作为输入,然后将音译版本转换为拉丁文作为输出.

I am trying to write a program that can transliterate CJK to Latin (i.e Pinyin, Romaji, etc.). For example you give a Chinese, Japanese or Korean document as input and then you get the transliterated version into Latin as output.

我在这个领域是新手,所以请在这里与我同在.

I am new in this field so please bear with me here.

显然,首先,我需要先检测语言的类型(中文,日文或韩文).然后,据我到目前为止所了解的,为了进行音译,我需要将文本分成单词,因为在这些语言中单词之间没有 space .这称为分词.最后,在找到单词后,我需要将其音译为拉丁语.

Obviously, first I need to detect the type of the language (Chinese, Japanese or Korean) before getting any further. Then, as I understood so far, in order to do the transliteration, I need to divide the text into words, since in these languages there is no space between words. This is called word segmentation. Finally after finding out the words I need to transliterate them into Latin.

这是我的问题:

  1. 有很多(不是真的!最好说些)库来完成音译工作,因为我正在寻找C/C ++的开源库,所以我找到了Adson(仅适用于中文)和ICU4C.来自Adson的克隆Git存储库未编译.而且我找不到ICU4C的简单直接的教程.如何找到有关ICU4C使用的一些教程?您知道其他图书馆将CJK音译成拉丁文吗?如果准确率更高(〜90%),我会忘记它是用C ++编写的.
  1. There are lots of (well not really! Better say some) libraries that do the transliteration job, since I'm looking for open source ones in C/C++, I found Adson (only for Chinese) and ICU4C. Cloned Git repo from Adson didn't compile. And I was not able to find simple, straight forward tutorial for ICU4C. How can I find some tutorial on ICU4C usage? Do you know any other library to transliterate CJK to Latin? If the accuracy ratio is higher(~90%), I can forget about it being written in C++.

推荐答案

ICU: http://userguide.icu-project.org/transforms/general ,现在ICU 50具有CJK单词分段功能. uconv样本可与uconv -f utf-8 -t utf-8 -x 'Any-Latin'之类的东西一起使用以进行Any-Latin变换.不过,这并未考虑语言.

ICU: there are examples in http://userguide.icu-project.org/transforms/general and ICU 50 now has CJK word segmentation. The uconv sample can be used with something like uconv -f utf-8 -t utf-8 -x 'Any-Latin' to go through Any-Latin transform. That doesn't take language into account, though.

这篇关于将CJK音译为拉丁语-最好使用C ++的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆