如何为土耳其语特殊字符(非ascii)定义/声明utf-8代码点以将其用作标准的utf-8编码? [英] How to define/declare utf-8 code points for Turkish special chars (non-ascii) to use them as standart utf-8 encoding?

查看:1559
本文介绍了如何为土耳其语特殊字符(非ascii)定义/声明utf-8代码点以将其用作标准的utf-8编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在utf-8编码中,Türkishchars'ÇçĞğİıÖöŞşüü'处理不正确,因为它们似乎都被定义。所有这些字符的Charcodes在使用中为65533(替换字符,可能是错误显示),根据所选字体显示问号或框。在某些情况下,0 / null作为charcode返回。在互联网上,有很多工具可以给出utf-8定义,但是我不确定工具是否使用任何定义的(实际/国际)注册表,或用已知的规则和计算来动态创建定义。它们的字体定义明确,当我们手动输入代码点时,不会显示它们的问题。这证明它们在utf-8中定义。但是另一方面,它们不是在编码或转换中处理,如ajax请求/响应。



所以基本问题是我们如何定义一个代码?
该问题可能如下定制,以防止误认。假设我们已经准备了这样的Ç的编码数据 - >
字符:Ç
字符名称:LATIN CAPITAL LETTER C WITH CEDILLA
十六进制代码点:00C7
十进制代码点:199
十六进制UTF-8字节:C387
......
在哪里/我们如何保存这个信息成为一个标准的utf-8字符?
我们如何分配/公开(准备好被别人使用)?
我们需要任何人/基金会的确认(如unicode / utf-8 consortium)
如果已经注册但不能正常工作,我们如何检测/修复错误?
我们可以使用custom-utf8配置吗?如果是,如何?



注意:这里不需要代码片段,因为它不是误用的问题。

解决方案

您提到的字符以Unicode格式存在。以下是十六进制的字符代码,以及如何以UTF-8编码:

 ÇçĞİİÖöŞ şÜü
代码:00c7 00e7 011e 011f 0130 0131 00d6 00f6 015e 015f 00dc 00fc
UTF8:c3 87 c3 a7 c4 9e c4 9f c4 b0 c4 b1 c3 96 c3 b6 c5 9e c5 9f c3 9c c3 bc

这意味着,如果您将字节0xc4 0x9e写入文件中,您已写入字符Ğ,任何了解UTF-8 的软件工具必须读取为Ğ



更新:为了在土耳其语中正确的字母顺序和案例转换,您必须使用一种了解区域设置的库,就像其他任何自然语言一样。例如在Java中:

  Locale tr = new Locale(TR,tr); //土耳其语locale 
print(ÇçĞğİıÖöŞşÜü.toUpperCase(tr)); //ÇÇĞĞİIÖÖŞŞÜÜ
print(ÇçĞğİıÖöŞşüü.toLowerCase(tr)); //ççğğııööşşüü

注意我是如何在大写成为İ,我在小写成为ı。你不会说你使用哪种编程语言,但肯定其标准库也支持语言环境。



Unicode定义每个字符的代码点和某些属性(例如,如果是数字或字母,如果是大写字母,小写字母或titlecase),以及用于处理Unicode文本的某些通用算法(例如,如何混合从右到左的文本和从左到右的文本)。字体顺序和正确的案例转换由国家标准化机构定义,例如芬兰的芬兰语言研究所

更新2:



对于世界上大多数语言,小写字母的测试((ch& 0x20)== ch)不仅仅是土耳其语。您提到的将大写字母转换为小写的算法也是如此。此外,作为一封信的测试是不正确的:在许多语言中,Z不是字母表的最后一个字母。要正确使用文本,您必须使用已知道他们正在做什么的人写的库函数。



Unicode应该是通用的。创建国家和语言特定的编码变体是导致我们遇到Unicode正在解决的混乱。不幸的是,没有用于排序字符的通用标准。例如,在英语中a =ä< z,但是在瑞典语a< z <一个。在德语中,Ü相当于一个标准的U,另一个用于UE。在芬兰语Ü= Y.没有办法订购代码点,以便每种语言的排序都是正确的。


Türkish chars 'ÇçĞğİıÖöŞşÜü' are not handled correctly in utf-8 encoding altough they all seem to be defined. Charcodes of all of them is 65533 (replacemnt character, possibly for error display) in usage and a question mark or box is displayed depending on the selected font. In some cases 0/null is returned as charcode. On the internet, there are lots of tools which give utf-8 definitions of them but I am not sure if tools use any defined (real/international) registry or dynamicly create the definition with known rules and calculations. Fonts for them are well-defined and no problem to display them when we enter code points manually. This proves that they are defined in utf-8. But on the other hand they are not handled in encodings or tranaformations such as ajax requests/responses.

So the base question is "HOW CAN WE DEFINE A CODEPOINT FOR A CHAR"? The question may be tailored as follows to prevent mis-conception. Suppose we have prepared the encoding data for "Ç" like this -> Character : Ç Character name : LATIN CAPITAL LETTER C WITH CEDILLA Hex code point : 00C7 Decimal code point : 199 Hex UTF-8 bytes : C387 ...... Where/How can we save this info to be a standard utf-8 char? How can we distribute/expose it (make ready to be used by others) ? Do we need any confirmation by anybody/foundation (like unicode/utf-8 consortium) How can we detect/fixup errors if they are already registered but not working correctly? Can we have custom-utf8 configuration? If yes how?

Note : No code snippet is needed here as it is not mis-usage problem.

解决方案

The charcters you mention are present in Unicode. Here are their character codes in hexadecimal and how they are encoded in UTF-8:

      Ç     ç     Ğ     ğ     İ     ı     Ö     ö     Ş     ş     Ü     ü
Code: 00c7  00e7  011e  011f  0130  0131  00d6  00f6  015e  015f  00dc  00fc
UTF8: c3 87 c3 a7 c4 9e c4 9f c4 b0 c4 b1 c3 96 c3 b6 c5 9e c5 9f c3 9c c3 bc

This means that if you write for example the bytes 0xc4 0x9e into a file you have written the character Ğ, and any software tool that understands UTF-8 must read it back as Ğ.

Update: For correct alphabetic order and case conversions in Turkish you have to use a library that understand locales, just like for any other natural language. For example in Java:

Locale tr = new Locale("TR","tr");     //    Turkish locale
print("ÇçĞğİıÖöŞşÜü".toUpperCase(tr)); //    ÇÇĞĞİIÖÖŞŞÜÜ
print("ÇçĞğİıÖöŞşÜü".toLowerCase(tr)); //    ççğğiıööşşüü

Notice how i in uppercase becomes İ, and I in lowercase becomes ı. You don't say which programming language you use but surely its standard library supports locales, too.

Unicode defines the code points and certain properties for each character (for example, if it's a digit or a letter, for a letter if it's uppercase, lowercase, or titlecase), and certain generic algorithms for dealing with Unicode text (e.g. how to mix right-to-left text and left-to-right text). Alphabetic order and correct case conversion are defined by national standardization bodies, like Institute of Languages of Finland in Finland, Real Academia Española in Spain, independent of Unicode.

Update 2:

The test ((ch&0x20)==ch) for lower case is broken for most languages in the world, not just Turkish. So is the algorithm for converting upper case to lower case you mention. Also, the test for being a letter is incorrect: in many languages Z is not the last letter of the alphabet. To work with text correctly you must use library functions that have been written by people who know what they are doing.

Unicode is supposed to be universal. Creating national and language specific variants of encodings is what lead us to the mess that Unicode is trying to solve. Unfortunately there is no universal standard for ordering characters. For example in English a = ä < z, but in Swedish a < z < ä. In German Ü is equivalent to U by one standard, and to UE by another. In Finnish Ü = Y. There is no way to order code points so that the ordering would be correct in every language.

这篇关于如何为土耳其语特殊字符(非ascii)定义/声明utf-8代码点以将其用作标准的utf-8编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆