如何将Unicode块与语言/脚本相关联? [英] How can I relate Unicode blocks to Languages/Scripts?

查看：90 发布时间：2020/5/3 4:40:01 unicode localization internationalization

本文介绍了如何将Unicode块与语言/脚本相关联?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试找到一种资源，该资源可用于将语言(或更可能是脚本)连接到Unicode字符块.这样的资源将用于查找诸如法语中使用了哪些Unicode块"之类的问题.或什么语言使用0A80-0AFF( http://unicodinator.com/#Block-Gujarati )中的块?"您知道这样的资源吗?

I am trying to find a resource that can be used to connect Languages (or more probably Scripts) to blocks of Unicode characters. Such a resource would be used to lookup questions such as "What Unicode Blocks are used in French?" or "What languages use the block from 0A80-0AFF (http://unicodinator.com/#Block-Gujarati)?" Do you know of such a resource?

我希望能够在 unicode.org 轻松找到此信息.我很快就能找到一个很好的表格，将国家代码与语言相关联( http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/territory_language_information.html ).但是我花了很多时间在闲逛，没有运气找到将Unicode块与语言联系起来的东西.可能是我遇到了一个术语问题，使我无法在这里连接点...

I would have expected to be able to find this information easily at unicode.org. I was quickly able to find a great table that relates Country Codes to Languages (http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/territory_language_information.html). But I've spent quite a bit of time poking around with no luck finding something that relates Unicode Blocks to Languages. Its possible I've got a terminology issue blocking me from connecting the dots here...

在这种情况下，我对语言"(Java语言环境代码或ISO 639代码或任何其他语言)的含义并不准确.我也了解可能没有确切答案，因为例如阿拉伯文文档除了阿拉伯文块中的字符外，还可以包含拉丁文和其他文本( http://unicodinator.com/#Block-Arabic_Supplement ).但是一定要有一个表说这些语言与这些块一起使用" ...我也不对格式(XML，CSV等)挑剔，我可以轻松地将其转换为可用于我的应用程序的数据.再一次，我确实知道引用可能会将脚本连接到块，而不是语言(尽管脚本可以映射到语言).

I am not picky about exactly what is meant by "language" (Java Locale code or ISO 639 code or whatever) in this case. I also understand that there may not be exact answers because, for instance, an Arabic document can contain Latin and other text in addition to characters from the Arabic blocks (http://unicodinator.com/#Block-Arabic, http://unicodinator.com/#Block-Arabic_Supplement). But surely there must be some table that says "these languages go with these blocks"... I'm also not picky about the format (XML, CSV, whatever), I can easily transform this into data I can use for my application. And again, I do realize the reference would probably connect Scripts to Blocks, not Languages (though Scripts can be mapped to Languages).

我确实意识到这将是一个多对多的表(因为许多语言使用来自多个块的字符，并且许多块被多种语言使用)；我确实意识到这无法准确回答，因为Unicode代码点不是特定于语言的-但是，这个国家/地区使用哪种语言"问题(大多数国家的回答可能是大多数语言")也无法解决.像这样( http://unicode.org/repos/cldr-tmp/trunk/diff /supplemental/territory_language_information.html )仍然可以创建，有意义且有用.

I do realize this will be a many-to-many table (since many languages use characters from multiple blocks, and many blocks are used by multiple languages); I do realize this cannot be precisely answered since Unicode codepoints are not language specific -- however, neither can the question of "what languages are there in this country" (answer is probably "most of them" for most countries), yet a table like this (http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/territory_language_information.html) is still possible to create, meaningful and useful.

关于为什么我想要这样的事情:我想增强 http://unicodinator.com 具有代码块的全局热图和语言列表；我也有一个自己想修改的游戏概念.除此之外，其他人可能还有其他用途(字体创建?现在Google Translate API即将淘汰，启发式，快速，最佳猜测的语言检测?研究项目?).

As to why I'd want such a thing: I would like to enhance http://unicodinator.com with global heat-maps for the code blocks, and lists of languages; I also have a game concept I am tinkering with. Beyond that, there are probably many other uses other people could have for this (font creation? heuristic, quick, best-guess language detection now that the Google Translate API is going away? research projects?).

推荐答案

我自己从Unicode.org得到了答案！在CLDR子项目中，有以下文件:

I got an answer from Unicode.org themselves! In the CLDR subproject, there are documents such as:

http://unicode.org/cldr/trac/browser/trunk/common/main/ar.xml
http://unicode.org/cldr/trac/browser/trunk/common/main/fr.xml

对于每种语言ID，您可以在其中搜索"exemplarCharacters":

for each language id, which you can search for "exemplarCharacters":

<exemplarCharacters>[\u064B \u064C \u064D \u064E \u064F \u0650 \u0651 \u0652 ء آ أ ؤ إ ئ ا ب ت ة ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي ى]</exemplarCharacters>
<exemplarCharacters type="auxiliary">[\u200C\u200D\u200E\u200F]</exemplarCharacters>
<exemplarCharacters type="currencySymbol" draft="contributed">[a b c d e f g h i j k l m n o p q r s t u v w x y z]</exemplarCharacters>
<exemplarCharacters type="index" draft="contributed">[ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي]</exemplarCharacters>

或者，存在以下页面: http://unicode.org/repos/cldr-tmp/trunk/diff/by_type/misc.exemplarCharacters.html ，看起来像所有的东西.我将努力将这些数据转换为某种langid-> blockid映射，在该映射下，我可能会知道@borrible是答案"(而不是让我知道答案).

Or, there is this page: http://unicode.org/repos/cldr-tmp/trunk/diff/by_type/misc.exemplarCharacters.html with what looks like all of them. I will work on reshuffling this data into a langid -> blockid map of some kind, at which I will probably aware @borrible the "Answer" (rather than make mine the answer).

这篇关于如何将Unicode块与语言/脚本相关联?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何将Unicode块与语言/脚本相关联? [英] How can I relate Unicode blocks to Languages/Scripts?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何将Unicode块与语言/脚本相关联? [英] How can I relate Unicode blocks to Languages/Scripts?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭