某些象形文字语言中的 Word-Counter? [英] Word-Counter in some hieroglyphics languages?

查看:40
本文介绍了某些象形文字语言中的 Word-Counter?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有任何可用的图书馆可用于某些象形文字语言(例如:中文、日语、韩语...)的字数统计?

Is there any available library for word-counting of some hieroglyphics language (ex: chinese, japanese, korean...)?

我发现 MS Word 可以有效地计算这些语言中的文本.我可以在我的 .NET 应用程序中添加对 MS Word 库的引用来实现此功能吗?

I found that MS Word count effectively texts in these languages. Can I add reference to MS Word libraries in my .NET application to implement this function?

或者有没有其他解决方案可以达到这个目的?

Or is there any other solutions to achieve this purpose?

推荐答案

是否有任何可用的图书馆可用于某些象形文字语言(例如:中文、日语、韩语...)的字数统计?

s there any available library for word-counting of some hieroglyphics language (ex: chinese, japanese, korean...)?

象形文字?不,他们不是.它们是logographic 字符,并没有那么细微的差别.我相信有些母语人士可能比我更能解释这一点.

Hieroglyphics? No, they're not. They're logographic characters and it's not so subtle difference. I'm sure some native speaker may explain this much better than me.

日语和中文文本由字符组成,与西方语言完全一样,但一个字符可能是一个词.此外,它们不需要空格来分隔单词,因此不能使用空格作为分隔符来区分字符/单词.

Japanese and Chinese text is made of characters exactly as western languages but one character may be a word to. Moreover they don't need spaces to separate words so our distinction characters/words can't be made using blanks as delimiters.

Word 所做的是计算字数(假设它们等于字符),您可以在代码中执行相同的操作(只是不要忘记它是 UNICODE,因此您无法计算字节数)计算字符数.要计算真实单词,您需要一本字典(因为您不能依赖空格).

What Word does is to count words (assuming they'll be equal to characters) and you can do the same in your code (just don't forget it's UNICODE so you can't count bytes) counting characters. To count real words you need a dictionary (because you can't rely on spaces).

例如这些字符串:

这是一个示例文本

これは、サンプルのテキストです

これは、サンプルのテキストです

按8字8字(中文)和15字15字计算日文.实际上不是(例如在日语中,当用罗马字音译时,它是 5 个单词).此外,别忘了在日语中,它们有不止一个字母表(其中一个是拼音的).

Will be counted as 8 characters and 8 words (in Chinese) and 15 characters and 15 words in Japanese. Actually it's not (for example in Japanese it's 5 words when transliterated in romaji). Moreover don't forget in Japanese they have more than one alphabet (and one family of them are phonetic).

有什么意义?你会数什么? 将单词音译为我们用来表示它们的一种语音表示(带有拉丁字符)?哪一个?字数统计会大不相同,它实际上会计算我们的概念字数(这就是为什么,我想,字数统计字符).

What's the point? What you will count? Words transliterated to one of phonetic representations (with latin characters) we use to represent them? Which one? Word counting will be pretty different and it'll actually count our concept of words (that's why, I suppose, Word counts characters).

也就是说现在尝试编写此代码:

That said now try to write this code:

string text = "这是一个示例文本";
MessageBox.Show(text.Length.ToString());

它将显示 8,就像 Word 那样(我们计算字符数),以字节为单位(假设 UTF-8 编码)为 24.在这里计算空格没有意义.如果您打算在一个音译中计算单词,您需要使用外部库(自己完成这不是一件容易的事),为您想要支持的每种语言使用不同的库(不知何故,它很容易自动检测语言,因为在日本人经常使用平假名/片假名字符).哪一个?有很多,我不知道中文,但在日语中,一种流行的汉字音译是 卡卡西.

It'll display 8, as Word does (we're counting characters), in bytes (supposing an UTF-8 encoding) is 24. No sense to count spaces here. If you plan to count words in one transliteration you need to use an external library (it's not an easy task to do it by yourself), a different one for each language you want to support (somehow it's easy to auto detect the language because in Japanese they use very often hiragana/katakana characters). Which one? There are a lot of them, I don't know for Chinese but in Japanese a popular one to transliterate Kanji is Kakasi.

韩语是一个完全不同的故事,它是一个与拉丁字母完全相同的字母表,但字符(应该称为音节)可能由许多字母组成.同样,它们不需要空格,因此您不能依赖它们进行字数统计.它在某种程度上更复杂,因为在这里您甚至可能需要一本字典来进行字符计数(否则您将只计算音节).

Korean is a complete different story, it's an alphabet exactly as latin one but character (that should be called syllable) may be composed of many letters. Again they don't need spaces so you can't rely on them for word counting. It's somehow more complicated because here you may need a dictionary even for character counting (otherwise you'll just count syllables).

这篇关于某些象形文字语言中的 Word-Counter?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆