如何判断哪些 Unicode 字符是字母(单词)还是标点符号? [英] How can I tell which unicode characters are letters (words) versus being punctation marks?

查看:43
本文介绍了如何判断哪些 Unicode 字符是字母(单词)还是标点符号?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想检测文本中的单词,即我需要知道给定文本中的哪些字符是字母,也就是说它们可以是(口语)单词的一部分,而另一方面,哪些是标点符号等.

I want to detect words in text, i.e. I need to know which characters in a given text are letters, that is they can be part of a (spoken) word and which are, on the other hand, punctuation and such.

例如,在上面的句子中,I"、want"和i"和e"是这方面的词,而空格则是.".和逗号不是.

For example, in the above sentence, "I", "want" and "i" and "e" are words in this regard, while spaces, "." and comma are not.

其中的困难在于我希望能够阅读任何基于 Unicode 的脚本.例如,德语单词schön"是一个单词.但是希腊语、阿拉伯语或日语呢?

The difficulty in this is that I want to be able to read any kind of script that's based on Unicode. E.g., the german word "schön" is one word. But what about greek, arabic or japanese?

所以,我需要的是一个表格或列表,指定可以构成单词的所有字符范围.或者,我还想知道哪些字符是可以构成数字的数字(假设其他脚本具有与阿拉伯数字类似的编号方案).

So, what I need is a table or list specifying all ranges of characters that can form words. Optionally, I also like to know which chars are digits that can form numbers (assuming other scripts have similar numbering schemes as the arabic numbers do).

我需要这个用于 Mac OS X、Windows 和 Linux.我将编写一个 C 应用程序,因此它需要是一个操作系统库或一个完整的代码/数据解决方案,我可以将其转换为 C.

I need this for Mac OS X, Windows and Linux. I'll write a C app, so it needs to be either a OS library or a complete code/data solution that I could translate into C.

我知道 Mac OS (Cocoa) 提供了用于此目的的功能,但我不确定是否有适用于 Win 和 Linux(可能基于 gtk?)的类似解决方案.

I know that Mac OS (Cocoa) offers functions for this purpose, but I am not sure if there are similar solutions for Win and Linux (gtk based, probably?).

或者,如果我有完整的表格,我也可以编写自己的代码.

Alternatively, I could write my own code if I had the complete tables.

我找到了 unicode 图表(http://unicode.org/charts/index.html).html#scripts) 但这并不是我可以在编程中使用的一种方便的形式.

I have found the unicode charts (http://unicode.org/charts/index.html#scripts) but that's not coming in one convenient form I could use in programming.

那么,有人可以告诉我是否有用于此目的的 Windows 和 Linux 函数,或者我可以在哪里找到 unicode 中的完整单词字符表/列表?

So, can someone tell me if there are functions for Windows and Linux for this purpose, or where I can find a complete table/list of word characters in unicode?

推荐答案

您可以尝试使用 Unicode 字符类别以找出可能的单词分隔符,但请注意某些语言(例如日语)甚至没有单词分隔符.

You can try to use the Unicode character category to figure out what the word separators may be, but be aware that some languages (e.g. Japanese) do not even have word separators.

这篇关于如何判断哪些 Unicode 字符是字母(单词)还是标点符号?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆