动态生成Ge'ez Unicode [英] Dynamically generating Ge'ez unicodes

查看:105
本文介绍了动态生成Ge'ez Unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

嗨.如果您看上面的图片,将会看到一组显示非常奇怪的字符以及一些拉丁字符.奇怪的是厄立特里亚字符.它们是我们在我国使用的字符.因此,要直截了当,我希望创建甚至最简单的软件,甚至可能创建一个批处理文件(如果可能),以帮助我使这些字符在网络上适用,并使PC理解并显示它们.输入.就像阿拉伯文,印度文,中文...一样使用字符.我认为,由于创建语言"的问题通常很少见,或者因为我可能不知道使用正确的术语,所以当我在互联网上搜索任何教程甚至是自由职业者之类的东西时,我所得到的只是...没有.因此,我希望,如果有人可以给我循序渐进的指南,甚至只是一个有关如何创建此指南的提示,都将非常有帮助.

谢谢.

解决方案

您的问题询问如何创建一种语言",因此我将描述为使用一种新语言而需要准备的所有内容(或更准确地说,写作系统).您是专门询问厄立特里亚字母的,所以我将提供有关现代系统如何支持厄立特里亚字母的具体示例,并尝试为您提供所缺少内容的指针.答案很长,并且提供了很多链接来支持这两种解释.

要使用 Ge'ez 这样的脚本(也称为Ethiopic,您需要一些东西.第一种是对字符进行编码的方法;第二种是对字符进行编码的方法.一组代表每个字符的数字,计算机可以使用这些数字来代表文本.幸运的是, Unicode 已经普及,并且Unicode被设计为一种通用字符集,包括所有世界上的语言. Unicode 3.0在范围U + 1200-U + 137F 中引入了Ethiopic,更高版本在 U + 1380-U + 1394 U + 2D80-U + 2DDF 提交一个将您的脚本添加到Unicode的建议;例如,请参阅埃塞俄比亚的提案.

现在,Unicode只是一个字符集;字符和数字之间的抽象映射.要实际将这些字符作为字节序列传输,请使用字符编码.有很多编码;其中一些,例如 ASCII

如果您使用的编码不能涵盖Unicode的全部范围,或者您没有很好的方式键入这些字符,并且您正在编写HTML或XML,则可以使用 font ;定义每个字符外观的文件.字体包含字形或每个字符的图形的集合.某些脚本,例如拉丁字母(用于英语和大多数欧洲语言的字母),相对而言简单的;每个字符都是一个单独的字形,它们的绘制方式不取决于字符前后的位置(尽管变音符号连字会使它变得更加复杂).其他的,例如阿拉伯语 Uniscribe DirectWrite ,在Linux上是 Pango 或像石墨.

幸运的是,Ge'ez是一个相当简单的书写系统,不需要任何专业的书刊支持或高级字体系统.每个字符都是一个单独的字形,并且不需要任何重新排序.因此,正常的 OpenType 字体(已与大多数计算机上已经提供的渲染系统一起显示)将执行工作.但是您仍然需要字体才能显示字符.要创建自己的字体,可以使用 FontForge (免费/开源工具),可用的字体很多,其中包括埃塞俄比亚字符,但我建议使用的是 Abyssinica SIL 来自免费许可可获得其字体,该字体允许您使用字体,重新分发字体,以及修改字体,因此它们的字体非常灵活,可以在多种情况下使用. Windows自Windows附带 Nyala ,自Windows以来,其中包括埃塞俄比亚字符Vista和 Ebrima ,在其中添加了对Ethiopic字符的支持Windows 8;因此使用Windows Vista或更高版本的用户应该已经可以查看埃塞俄比亚字符. Mac OS X随附 Kefa(自10.6起).

一旦有了字体,就可以查看埃塞俄比亚字符.但是其他阅读您文档的人可能没有这些字体(如果他们使用的是Windows或Mac OS X的较旧版本,或者未安装Windows随附的所有字体等),在这种情况下,字符可能会在其计算机上显示为方框或问号.您可以为这些人提供可再发行的字体,例如Abyssinica SIL,或者他们可以购买包含Ethiopic字符的字体,但这可能会带来不便.对于处理文字处理器文档或纯文本,这可能是您最好的选择.他们将需要在计算机上安装字体才能显示文本.如果您在计算机上创建PDF,则它应嵌入显示文本所需的字体,因此创建PDF是一种在文档中包含不常见字体的便捷方法.

在网页上,您可以使用网络字体链接到样式表中的字体,从而允许用户Web浏览器为该网页加载该字体.从一直到IE 6一直支持网络字体,,以及大多数其他网络浏览器的最新版本,因此实际上它们得到了广泛的支持.不同的网络浏览器支持不同的字体文件格式( EOT OpenType SVG Google Web字体将字体上传到FontSquirrel ,它将把它转换成所有主要字体格式,并提供在所有现代浏览器上均可使用的CSS示例.请注意,您只能使用允许网络嵌入的字体来执行此操作;并非所有字体都可以.由于Abyssinica SIL可以在开放字体许可"下获得,因此您可以使用它,而我已经通过FontSquirrel为您运行了它.您可以查看其工作原理(检查请点击[字形和语言]标签,或下载该工具包.要使用它,只需将字体文件(.ttf.eot.svg.woff)放在服务器上与CSS相同的目录中,并在CSS中包含以下内容:

@font-face {
    font-family: 'abyssinica_silregular';
    src: url('abyssinicasil-r.eot');
    src: url('abyssinicasil-r.eot?#iefix') format('embedded-opentype'),
         url('abyssinicasil-r.woff') format('woff'),
         url('abyssinicasil-r.ttf') format('truetype'),
         url('abyssinicasil-r.svg#abyssinica_silregular') format('svg');
    font-weight: normal;
    font-style: normal;
}

现在,您知道如何编码埃塞俄比亚字符,查看埃塞俄比亚字符并共享包含埃塞俄比亚字符的文档,您可能会想要将它们键入文档中.如果您使用的是HTML,则只需输入上述数字字符引用即可.在其他文档中,您可以只复制并粘贴所有图表中的字符,例如Wikipedia页面.但这将变得非常麻烦.根据您的系统和设置,您还可以使用 Unicode十六进制输入输入任意的Unicode字符,但这也很麻烦.

要完全支持在计算机上键入脚本,您需要键盘布局或<一种href ="http://en.wikipedia.org/wiki/Input_method" rel ="nofollow noreferrer">输入法.可以使用简单的键盘布局来键入某些脚本,该布局指出哪些键对应于哪些字符.如果脚本中的字符多于键盘上的键,则可以使用Shift和Alt(或在Mac上为Option)来映射到更多字符. 死键也可以用于扩展您键入的字符范围;死键是产生单个字形的两个或更多个击键的序列;例如,在Mac OS X上,要键入á",可以键入 Option-E A .要在Windows上创建键盘布局,可以使用 Microsoft键盘布局创建器 . Mac OS X使用 XML格式进行键盘布局,因此您可以直接创建一个,或使用 Ukelele 从SIL轻松创建一个.在使用X11的系统(例如Linux)上,您可以输入法管理器用于编写输入法,Mac OS X SCIM iBus .

用于Ethiopic的标准输入法广泛使用了死键.看来,最流行的Ethiopic输入法是 Keyman ,这是一种商业输入法,可在Mac和Windows上运行,此外还有一个免费变体 KMFL ,可在Linux上运行. SIL为此输入法键盘下载;它们还具有Mac OS X的键盘布局,该布局使用死键来实现相同的目的. Mac OS X具有更广泛的死键支持,因此不需要输入法即可支持这种形式的输入,而在Windows上,您需要使用诸如Keyman之类的输入法才能以这种方式输入输入. Google为Windows提供了免费的输入法 Windows Windows输入工具,支持Amharic,并允许您自定义其输入方案;您可以尝试修改他们对Tigrinya的Amharic支持.

如果您只需要支持网站上的输入,则可以使用JavaScript编写此输入法,该方法可以使用JavaScript编写一种将某人键入的内容转译为Ethiopic的输入方法.我不知道有任何现成的框架可以做到这一点.但是,我发现韩语 KeymanWeb ,一种JavaScript您可以购买并嵌入到您网站中的基于输入的方法. MediaWiki还具有输入法扩展名 Narayam ,其中包括MediaWiki的基于JavaScript的输入法Wikipedia之类的网站,其中包含实验性Amharic输入法.还有一个草稿W3C IME API ,它有助于在Web应用之间提供接口和本地IME,以及基于JavaScript的IME.鉴于它仍然是草稿,所以我不知道它是否在任何地方都受支持.

使用上述所有内容(字符集,编码,字体,渲染支持和输入方法),您将能够在脚本中创建,共享和查看文档.如果这就是您所需要的,那就太好了;上面的内容使您可以使用给定脚本中的文档.但是,要完全支持计算机上的一种语言,而不仅仅是其脚本或书写系统,还需要另外两部分:.用作小数点分隔符(1.5表示为"1.5"),而在西班牙,数字,用作小数点分隔符(1½被写为"1,5").语言环境指定所有这些规则.由于语言环境可能会根据语言,文化以及某些其他因素而有所不同,因此通常使用语言和国家/地区来指定语言环境,也可以使用其他信息.

用于命名区域设置的最广泛使用的标准是 RFC 4646(BCP 47).语言环境通常指定为" ln - CC ",语言代码为 ln 和国家/地区代码为 CC :美国英语是en-US,英国英语是en-UK,法国的法语是fr-FR.如果需要指定更多信息,则可以将其包括在内.例如,塞尔维亚语可以用拉丁语或 Microsoft语言环境生成器创建自定义格式.一个>.可以使用 localedef 创建POSIX(Unix/Linux)语言环境.如今,许多系统都朝着 Unicode通用语言环境数据注册表进行规定,该标准将语言环境数据的标准化格式指定为以及用于多种世界语言的语言环境的综合数据库. ICU 是用于C和Java的库(供许多其他环境使用),用于根据Unicode文本进行操作Unicode规则和语言环境数据;他们有一个很好的浏览器用于来自CLDR的数据和他们自己的语言环境数据.例如,看看他们的 ti- ER .

最后,要完全支持某种语言,您需要将软件本身翻译成该语言.当然,有很多软件,每个软件包含许多需要翻译的字符串.某些软件不能翻译.尚未国际化.某些软件只能由创建者翻译.字符串已内置到程序中,无法由第三方轻易修改.但是可以本地化某些软件,将其翻译成您的语言和文化.如果该软件已经针对其他几种语言和文化进行了本地化,那么它可能足够灵活以支持一种新的语言,并且如果该软件使用了易于修改的本地化信息格式,则可以由第三方进行修改.

例如,Mac OS X上的应用程序将其本地化数据存储在应用程序捆绑包内的单独文件中.有一个名为AppleGlot的工具(您需要注册 Mac开发人员计划,然后转到下载区域找到它),可以帮助您提取该数据,为文件提供所有需要翻译的字符串,并让您在拥有后再次将其与应用程序结合起来.对于开源软件,例如Linux上可用的许多软件,您可以与开发人员一起提供翻译.某些软件将 gettext 用作翻译字符串,该字符串使用PO文件格式您可以使用 poedit 进行编辑.有些使用Qt,您可以使用 Qt语言学家.或者,要处理多种格式,可以使用诸如 Swordfish 之类的商业产品.或 Transifex .

当然,没有人可以做上述所有事情;它需要许多人共同努力才能为现代计算机系统上的新语言建立支持.所有这些都旨在对给定语言提供语言支持的所有组件进行高级别浏览,并提供参考资料,这些参考资料将帮助您跟进您想从事的任何方面的工作,并演示已经进行的工作.适用于Tigrinya和Ge'ez脚本.

Hi. If you look at the image above, you will see a set of very weird-looking characters displayed along with some Latin characters. The weird ones are Eritrean characters. They are the characters we use in my country. So, to go strait to the point, I am hoping to create even the simplest possible bit of software or maybe even a batch file (if possible) to help me make these characters applicable on the web and make PCs understand and display them when being typed. Just like Arabic, Hindu, Chinese... characters are used. I think, since the question of 'creating a language' is often rare or because I may not know the correct term to use, when I searched the internet to find any tutorial or even a freelancer or anything, all I got was... nothing. So, I am hoping, if anyone can give me a step-by-step guide, or even just a clue about how to create this, would be very helpful.

Thanks.

解决方案

Your question asks "how to create a language", so I will describe all the pieces that need to be in place for a new language (or more accurately, writing system). You ask specifically about the Eritrean alphabet, so I will provide specific examples of how that is supported on modern systems, and try to provide you pointers for the pieces you are missing. The answer is long, and provides lots of links, to support the two explanations.

To work with a script like Ge'ez (also known as Ethiopic, the script used to write Amharic in Ethiopia and Tigrinya in Eritrea) you need a few things. The first is a way to encode the characters; a set of numbers representing each character, that the computer can use to represent the text. Luckily, Unicode has become widespread, and Unicode is designed to be a universal character set that includes all of the world's languages. Unicode 3.0 introduced Ethiopic in the range U+1200-U+137F, and later versions added supplements of more obscure characters in the ranges U+1380-U+1394, U+2D80-U+2DDF and U+AB00-U+AB2F. If you wanted to support a language that Unicode didn't yet support, you would either need to use the private use area and define your own mapping of characters to code points, or submit a proposal to have your script added to Unicode; for example, see the proposal for Ethiopic.

Now, Unicode is just a character set; an abstract mapping between characters and numbers. To actually transmit these characters as a sequence of bytes, you use a character encoding. There are many encodings; some of them, like ASCII and ISO-8859-1 only cover a subset of the full Unicode character set, while others, like UTF-8 and UTF-16, cover the full range. For documents on the web, UTF-8 is the recommended character encoding; you should never use anything else if you can help it. In UTF-8, you can write Ge'ez directly in the document, for example: ኤርትራ. One thing to watch out for is that some programs (especially on Windows) will offer you "Unicode" as an encoding, when they mean UTF-16; you want to make sure to choose UTF-8, as it's more efficient and more compatible with a wider variety of software.

If you are using encodings that don't cover the full range of Unicode, or you don't have a good way to type those characters, and you are writing HTML or XML, you can use numeric character references instead. To do this, you write the Unicode code point of the character you want to refer between &# and ;. You can write the number in decimal, or in hexadecimal prefixed with an x. For example, ሀ can be written &#x1200; or &#4608; (the semicolon at the end is important; it wasn't working for you in the comments because you were missing it).

Now that you have a character set, and a way of encoding it, you need a way to display it. Some scripts are easier to display in others. For all scripts, you need a font; a file defining how each character looks. A font contains a collection of glyphs, or drawings of each character. Some scripts, like the Latin alphabet (the alphabet used for English and most European languages) are relatively simple; each character is a separate glyph, and how they are drawn doesn't depend on what characters come before or after (though diacritics and ligatures can make it a little more complicated). Others, like Arabic and Indic scripts are written in cursive, where letters join to each other so how they are drawn can depend on the characters near them. These languages require special rendering support like Uniscribe or DirectWrite on Windows, Pango on Linux, or advanced font technology like Apple Advanced Typography or Graphite.

Luckily, Ge'ez is a fairly simple writing system, that doesn't require any specialized rending support or advanced font systems. Each of the characters is a separate glyph, and it doesn't require any reordering. So a normal OpenType font, displayed with the rendering systems already available on most computers, will do the job. But you still need the font in order to be able to display the characters. To create you own font, you can use FontForge (a free/open source tool), Fontographer, FontLab Studio, or other similar software.

For Ethiopic, you don't need to create your own. There are numerous fonts available that include the Ethiopic characters, but one that I would recommend is Abyssinica SIL from SIL (the Summer Institute of Linguistics), which does a lot of great work for minority languages and writing systems. Their fonts are available under a free license, that allows you to use the font, redistribute the font, and modify the font, so their fonts are quite flexible and can be used in a wide variety of situations. Windows ships with Nyala, which includes Ethiopic characters, since Windows Vista, and Ebrima, which added support for Ethiopic characters in Windows 8; so people on Windows Vista or later should be able to view Ethiopic characters already. Mac OS X ships with Kefa as of 10.6.

Once you have the font, you will be able to view Ethiopic characters. But other people reading your documents might not have those fonts (if they are using an older version of Windows or Mac OS X, if they didn't install all of the fonts that came with Windows, or the like), in which case the characters will probably show up as boxes or question marks on their machine. You could give those people a redistributable font like Abyssinica SIL, or they could buy a font that includes Ethiopic characters, but that can be inconvenient. For working with word processor documents or plain text, that's probably the best you can do; they will need the font installed on their computer to be able to display the text. If you create a PDF on your computer, it should embed the fonts that it needs to display the text, so creating a PDF can be a convenient way to include uncommon fonts with your document.

On a web page, you can use web fonts to link to a font from your stylesheet, allowing the users web browser to load that font for that web page. Web fonts are supported all the way back to IE 6, and in recent versions of most other web browsers, so they are actually quite widely supported. Different web browsers support different font file formats (EOT, TTF, OpenType, SVG, and WOFF), and slightly different syntaxes for the CSS (older versions of IE are based on an older draft), so it can be a bit tricky to make a page that is compatible with all browsers. Luckily, people have automated that process. Some web fonts are available online from Google Web Fonts or FontSquirrel, but sadly, I couldn't find any Ethiopic fonts already hosted. However, you can upload a font to FontSquirrel, and it will convert it into all of the major formats, and provide example CSS that will work on all modern browsers. Note that you should only do this with fonts that allow web embedding; not all fonts do. Since Abyssinica SIL is available under the Open Font License, you can use it, and I've run it through FontSquirrel for you; you can see how it works (check out the Glyphs & Languages tab), or download the kit. To use it, just put the font files (.ttf, .eot, .svg, and .woff) on your server in the same directory as your CSS, and include the following in your CSS:

@font-face {
    font-family: 'abyssinica_silregular';
    src: url('abyssinicasil-r.eot');
    src: url('abyssinicasil-r.eot?#iefix') format('embedded-opentype'),
         url('abyssinicasil-r.woff') format('woff'),
         url('abyssinicasil-r.ttf') format('truetype'),
         url('abyssinicasil-r.svg#abyssinica_silregular') format('svg');
    font-weight: normal;
    font-style: normal;
}

Now that you know how to encode Ethiopic, view Ethiopic characters, and share documents containing Ethiopic characters, you are probably going to want to type them into documents. If you are using HTML, you could just type the numeric character reference described above. In other documents, you could just copy and paste the characters from a chart of all of them, like the Wikipedia page. But that would become pretty cumbersome. Depending on your system and settings, you can also use Unicode Hex Input to enter arbitrary Unicode characters, but that is also cumbersome.

To fully support typing a script on your computer, you need a keyboard layout or input method. Some scripts can be typed with a simple keyboard layout, which says which keys correspond to which characters. If a script has more characters than there are keys on the keyboard, Shift and Alt (or Option on the Mac) can be used to map to more characters. Dead keys can also be used to expand the range of characters that you type; dead keys are sequences of two or more keystrokes that produce a single glyph; for example, on Mac OS X, to type "á", you can type Option-E A. To create a keyboard layout on Windows, you can use the Microsoft Keyboard Layout Creator. Mac OS X uses an XML format for keyboard layouts, so you can create one directly, or use Ukelele from SIL to create one more easily. On systems using X11 (like Linux), you can create your own XKB layouts.

If you need more characters than can be supported with modifiers and dead keys, like typing Chinese or Japanese, then you need a full-fledged input method. An input method allows you to run arbitrary code to map what someone types into the text it produces; for example, in a Japanese input method, you may type a phonetic representation of what you you are writing, and it will show you a drop down list of possible characters that match that representation, allowing you to choose the appropriate ones. Windows provides the Input Method Manager for writing input methods, Mac OS X the Input Method Kit, and X11 has a few ways to do it, such as SCIM and iBus.

The standard input method for Ethiopic makes extensive use of dead keys. It looks like the most popular existing input method for Ethiopic is Keyman, which is a commercial input method that works on Mac and Windows, and in addition there's a free variant, KMFL, that works on Linux. SIL has keyboard downloads for this input method; they also have a keyboard layout for Mac OS X which uses dead keys to achieve the same thing. Mac OS X has more extensive dead key support, so it doesn't require an input method to support this form of input, while on Windows you need to use an input method like Keyman to be able to enter input this way. Google has a free input method for Windows, Google Input Tools for Windows, which supports Amharic, and allows you to customize its input schemes; you could try adapting their Amharic support for Tigrinya.

If you just need to support input on a web site, you could do this in JavaScript, by writing an input method in JavaScript that transliterates from what someone types into Ethiopic. I do not know of any existing frameworks for doing this; however, I have found Korean and Japanese input methods implemented in JavaScript. You could take a look at how those are implemented. Upon looking further, I've found that Tavultesoft, who make Keyman, also have KeymanWeb, a JavaScript based input method that you can buy and embed in your site. MediaWiki also has an input method extension Narayam, that includes a JavaScript based input method for MediaWiki based sites like Wikipedia, which includes an experimental Amharic input method. There is also a draft W3C IME API, which helps provide an interface between web apps and native IMEs, as well as JavaScript based IMEs. Given that it's still a draft, I don't know if it is yet supported anywhere.

With all the above (a character set, encoding, fonts, rendering support, and an input method), you will be able to create, share, and view documents in your script. If that's all you need, great; the above will allow you to work with documents in a given script. But for full support for a language on your computer, not just its script or writing system, there are two more pieces that you need: a locale, and your software to be localized (translated and adapted) for your language.

A locale specifies how programs should manipulate text in a given script, language, culture, and/or encoding. There are many common text processing operations that programs do: displaying numbers, displaying dates and times, sorting strings or names, and so on. How these should work can differ based on the language, script, and culture of the person using the program; for instance, in Swedish "ü" is sorted along with "y", while in English and German it's sorted along with "u". Differences may not be based on language: both Mexico and Spain use Spanish, but in Mexico numbers are displayed with . as the decimal separator (1½ is written "1.5"), while in Spain , is used as the decimal separator (1½ is written "1,5"). A locale specifies all of these rules. Because the locale can vary based on language, culture, and sometimes other factors, the language and country are usually used to specify the locale, and other information can be used as well.

The most widely used standard for naming locales is RFC 4646 (BCP 47). Locales are usually specified as "ln-CC" with the language code ln and country code CC: US English is en-US, British English is en-UK, and French in France is fr-FR. If more information needs to be specified, it can be included. For instance, Serbian can be written with either Latin or Cyrillic, and so Serbian in Serbia can be either sr-Latn-CS or sr-Cyrl-CS. Tigrinya in Eritrea is written ti-ER.

There are a variety of different formats for defining the rules that a particular locale has. Windows uses NLP files, a custom format that can be created with Microsoft Locale Builder. POSIX (Unix/Linux) locales can be created using localedef. Many systems these days are moving towards the Unicode Common Locale Data Registry, which specifies a standardized format for locale data as well as a comprehensive database of locales for many of the worlds languages. ICU is a library for C and Java (and used by many other environments) for manipulating Unicode text according to Unicode rules and locale data; they have a good browser for the data from the CLDR and their own locale data. For example, take a look at their entry for ti-ER.

Finally, for full support of a language, you need to translate the software itself into that language. There are, of course, many pieces of software, and each one contains many strings that need to be translated. Some software is not designed to be translated; it has not been internationalized. Some software can only be translated by whoever created it; the strings are built into the program and cannot be easily modified by a third party. But it is possible to localize some software, translating it to your language and culture. If the software has already been localized for several other languages and cultures, it is likely to be flexible enough to support a new language, and if it uses formats that are easily modifiable for localization information, it can be modified by third parties.

For instance, applications on Mac OS X store their localization data in separate files within the application bundle. There is a tool called AppleGlot (you need to register for the Mac Developer Program and go to the downloads area to find it) which can help you extract that data, provide a file with all of the strings which need to be translated, and allow you to combine that with the application again once you have. For open source software, such as much software available on Linux, you can work with the developers to provide translation. Some software uses gettext for translation strings, which use the PO file format that you can edit using poedit. Some uses Qt, for which you can use Qt Linguist. Or for dealing with a wide variety of formats, you can use a commercial offering like Swordfish or Transifex.

Of course, no one person can do all of the above; it takes many people working together to build support for a new language on modern computer systems. This is all intended to be a high-level tour of all of the components that go into language support for a given language, with references that will help you follow up on whichever aspects you would like to work on, as well as demonstrate what already works for Tigrinya and the Ge'ez script.

这篇关于动态生成Ge'ez Unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆