如何将Unicode块与语言/脚本相关联? [英] How can I relate Unicode blocks to Languages/Scripts?

查看:90
本文介绍了如何将Unicode块与语言/脚本相关联?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找到一种资源,该资源可用于将语言(或更可能是脚本)连接到Unicode字符块.这样的资源将用于查找诸如法语中使用了哪些Unicode块"之类的问题.或什么语言使用0A80-0AFF( http://unicodinator.com/#Block-Gujarati )中的块?"您知道这样的资源吗?

I am trying to find a resource that can be used to connect Languages (or more probably Scripts) to blocks of Unicode characters. Such a resource would be used to lookup questions such as "What Unicode Blocks are used in French?" or "What languages use the block from 0A80-0AFF (http://unicodinator.com/#Block-Gujarati)?" Do you know of such a resource?

我希望能够在 unicode.org 轻松找到此信息.我很快就能找到一个很好的表格,将国家代码与语言相关联( http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/territory_language_information.html ).但是我花了很多时间在闲逛,没有运气找到将Unicode块与语言联系起来的东西.可能是我遇到了一个术语问题,使我无法在这里连接点...

I would have expected to be able to find this information easily at unicode.org. I was quickly able to find a great table that relates Country Codes to Languages (http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/territory_language_information.html). But I've spent quite a bit of time poking around with no luck finding something that relates Unicode Blocks to Languages. Its possible I've got a terminology issue blocking me from connecting the dots here...

在这种情况下,我对语言"(Java语言环境代码或ISO 639代码或任何其他语言)的含义并不准确.我也了解可能没有确切答案,因为例如阿拉伯文文档除了阿拉伯文块中的字符外,还可以包含拉丁文和其他文本( http://unicodinator.com/#Block-Arabic_Supplement ).但是一定要有一个表说这些语言与这些块一起使用" ...我也不对格式(XML,CSV等)挑剔,我可以轻松地将其转换为可用于我的应用程序的数据.再一次,我确实知道引用可能会将脚本连接到块,而不是语言(尽管脚本可以映射到语言).

I am not picky about exactly what is meant by "language" (Java Locale code or ISO 639 code or whatever) in this case. I also understand that there may not be exact answers because, for instance, an Arabic document can contain Latin and other text in addition to characters from the Arabic blocks (http://unicodinator.com/#Block-Arabic, http://unicodinator.com/#Block-Arabic_Supplement). But surely there must be some table that says "these languages go with these blocks"... I'm also not picky about the format (XML, CSV, whatever), I can easily transform this into data I can use for my application. And again, I do realize the reference would probably connect Scripts to Blocks, not Languages (though Scripts can be mapped to Languages).

我确实意识到这将是一个多对多的表(因为许多语言使用来自多个块的字符,并且许多块被多种语言使用);我确实意识到这无法准确回答,因为Unicode代码点不是特定于语言的-但是,这个国家/地区使用哪种语言"问题(大多数国家的回答可能是大多数语言")也无法解决.像这样( http://unicode.org/repos/cldr-tmp/trunk/diff /supplemental/territory_language_information.html )仍然可以创建,有意义且有用.

I do realize this will be a many-to-many table (since many languages use characters from multiple blocks, and many blocks are used by multiple languages); I do realize this cannot be precisely answered since Unicode codepoints are not language specific -- however, neither can the question of "what languages are there in this country" (answer is probably "most of them" for most countries), yet a table like this (http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/territory_language_information.html) is still possible to create, meaningful and useful.

关于为什么我想要这样的事情:我想增强 http://unicodinator.com 具有代码块的全局热图和语言列表;我也有一个自己想修改的游戏概念.除此之外,其他人可能还有其他用途(字体创建?现在Google Translate API即将淘汰,启发式,快速,最佳猜测的语言检测?研究项目?).

As to why I'd want such a thing: I would like to enhance http://unicodinator.com with global heat-maps for the code blocks, and lists of languages; I also have a game concept I am tinkering with. Beyond that, there are probably many other uses other people could have for this (font creation? heuristic, quick, best-guess language detection now that the Google Translate API is going away? research projects?).

推荐答案

我自己从Unicode.org得到了答案!在CLDR子项目中,有以下文件:

I got an answer from Unicode.org themselves! In the CLDR subproject, there are documents such as:

  • http://unicode.org/cldr/trac/browser/trunk/common/main/ar.xml
  • http://unicode.org/cldr/trac/browser/trunk/common/main/fr.xml

对于每种语言ID,您可以在其中搜索"exemplarCharacters":

for each language id, which you can search for "exemplarCharacters":

<exemplarCharacters>[\u064B \u064C \u064D \u064E \u064F \u0650 \u0651 \u0652 ء آ أ ؤ إ ئ ا ب ت ة ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي ى]</exemplarCharacters>
<exemplarCharacters type="auxiliary">[\u200C\u200D\u200E\u200F]</exemplarCharacters>
<exemplarCharacters type="currencySymbol" draft="contributed">[a b c d e f g h i j k l m n o p q r s t u v w x y z]</exemplarCharacters>
<exemplarCharacters type="index" draft="contributed">[ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي]</exemplarCharacters>

或者,存在以下页面: http://unicode.org/repos/cldr-tmp/trunk/diff/by_type/misc.exemplarCharacters.html ,看起来像所有的东西.我将努力将这些数据转换为某种langid-> blockid映射,在该映射下,我可能会知道@borrible是答案"(而不是让我知道答案).

Or, there is this page: http://unicode.org/repos/cldr-tmp/trunk/diff/by_type/misc.exemplarCharacters.html with what looks like all of them. I will work on reshuffling this data into a langid -> blockid map of some kind, at which I will probably aware @borrible the "Answer" (rather than make mine the answer).

这篇关于如何将Unicode块与语言/脚本相关联?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆