UTF-8和ISO 8859-9 [英] UTF-8 and ISO 8859-9

查看:321
本文介绍了UTF-8和ISO 8859-9的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在阅读关于UTF-8和unicode的最后几天,当我认为我想到了这一切,我很困惑,当我读到UTF-8和ISO 8859-9不兼容。



我有一个数据库将数据存储为UTF-8。我有一个客户要求支持各种ISO 8859-x代码页(即8859-3,8859-2,以及ISO 6937)。我的问题是:


  1. 由于我的数据采集和数据库引擎类型是UTF-8,我使用unicode?


  2. 我理解unicode可以支持所有字符,这是要走的路。但是,我的客户是一个欧洲实体,希望我们使用ISO代码页。所以我的问题是如何支持使用现有的UTF-8数据的多客户端使用案例?由于ISO 8859-x不是unicode的子集,我必须编写代码来发送适当的ISO 8859-x字符集,具体取决于我的用例吗?


btw,我的理解是UTF-8是仅仅是从二进制数据获得数值的编码算法。如果是,如何应用字符集?我必须写一个代码来返回8859-x响应,或者所有需要的是在响应头中设置一个合适的字符集值。

解决方案<


因为我们已经很简单了,我的数据采集和数据库引擎类型是UTF-8,是否正确假设我使用unicode?

strong>您正在使用UNICODE ,并且使用UTF-8编码存储UNICODE字符(正式称为代码点)。请注意,UNICODE定义规则和字符集(即使相同的字通常用作UTF-16编码的同义词),在字节流中编码这些字符的方式是另一回事。


...但是,我的客户是一个欧洲实体,希望我们使用ISO代码页。所以我的问题是如何支持使用现有的UTF-8数据的多个客户端用例?


当然,如果你存储UNICODE字符它与哪个编码无关),然后您可以随时将它们转换为特定的ASCII代码页(或任何其他编码)。确定这不是正式总是真的(因为UNICODE不定义每个可能的字符实际使用/使用在过去),但我会忽略这一点...


...由于ISO 8859-x不是unicode的子集,我必须写代码来发送ISO 8859-x的相应字符集,取决于我的用例吗?


来自ISO 8859 代码页的所有字符也可在UNICODE 中使用视图)它是一个子集。当然编码的值是不同的,所以它们需要被转换。如果你知道每个客户需要的代码页,那么你总是可以将UNICODE UTF-8编码的文本转换为ASCII(正确的代码页)文本。


这是我需要做还是还有更多?


代码可能很短,但你没有标记你的问题用任何语言,所以我不会提供链接/例子。只是一个基本的例子,看看 this post



让我也说一个重要的事情:如果他们想用他们的代码使用ASCII数据页面,则必须执行转换。如果他们可以直接使用UTF-8数据(或者你在自己的应用程序中以某种方式呈现),那么你不必担心代码页(这就是为什么我们使用UNICODE),因为 - 没有重要的编码 - UNICODE字符集包含所有他们可能需要的字符。


btw,我的理解是,UTF-8只是一个编码算法从二进制数据。


不完全是。你有一个字符表,对不对?例如 A 。现在,您必须存储一个将被解释为 A 的数值。在ASCII中,它们任意决定 65 是表示该字符的数字值。 UNICODE是一个长字符列表(以及用于组合它们的规则),UTF-X是用于将它们作为数值存储的任意表示。


如果是,如何应用字符集?


字符集是一个非常模糊的句子。使用 UNICODE字符集表示 所有可用的字符。如果您的意思是代码页,那么(简化)它代表可用字符集的子集。想象一下,你有8位ASCII(然后最多256个符号),你不能容纳欧洲使用的所有字符,对吧?代码页解决这个问题,这些符号中的一半总是相同的,另一半代表根据代码页的不同字符(每个国家将使用具有其优选字符的特定代码页)。



有关此主题的简介概述:绝对最低每个软件开发人员绝对,积极地了解Unicode和字符集


I have been reading about UTF-8 and unicode for the last couple of days and when I thought I figured it all, I am confused when I read that UTF-8 and ISO 8859-9 are not compatible.

I have a database that stores data as UTF-8. I have a requirement from a customer to support various ISO 8859-x code pages (i.e. 8859-3, 8859-2, and also ISO 6937). My questions are:

  1. Since my data ingest and database engine type is UTF-8, would it be correct to assume that I am using unicode?

  2. I understand that unicode can support all characters and it is the way to go. However, my customer is an european entity that wants us to use ISO code pages. so my question is how can I support multiple client use cases using existing UTF-8 data? Since ISO 8859-x is not a subset of unicode, do I have to write code to send appropriate character set of ISO 8859-x depending on my use cases? Is that I need to do or there is more to it?

btw, my understanding is that UTF-8 is merely an encoding algorithm to get a numeric value from binary data. if so, how character set is applied? Do I have to write a code to return 8859-x response or all that's needed is to set an appropriate character set value in the response header?

解决方案

Topic is pretty vast so let me simplify (a lot, even too much) and answer point by point.

Since my data ingest and database engine type is UTF-8, would it be correct to assume that I am using unicode?

Yes, you're using UNICODE and you're storing UNICODE characters (formally called code points) using UTF-8 encoding. Please note that UNICODE defines rules and sets of characters (even if same word is often used as synonym of UTF-16 encoding), the way you encode such characters in a byte stream is another thing.

... However, my customer is an european entity that wants us to use ISO code pages. so my question is how can I support multiple client use cases using existing UTF-8 data?

Of course if you store UNICODE characters (it doesn't matter with which encoding) then you can always convert them to a specific ASCII code page (or to any other encoding). OK this isn't formally always true (because UNICODE doesn't define every possible characters actually in use/used in the past) but I would ignore this point...

... Since ISO 8859-x is not a subset of unicode, do I have to write code to send appropriate character set of ISO 8859-x depending on my use cases?

All characters from ISO 8859 code pages are also available in UNICODE then (from this point of view) it's a subset. Of course encoded values are different so they need to be converted. If you know needed code page for each customer then you can always convert an UNICODE UTF-8 encoded text into an ASCII (with right code page) text.

Is that I need to do or there is more to it?

Just that. Code could be pretty short but you didn't tag your question with any language so I won't provide links/examples. Just for a rudimentary example take a look to this post.

Let me also say one important thing: if they want to consume your data in ASCII with their code page then you have to perform a conversion. If they can consume directly UTF-8 data (or you present them somehow in your own application) then you don't have to worry about code pages (that's why we're using UNICODE) because - no matters encoding - UNICODE character set contains all characters they may need.

btw, my understanding is that UTF-8 is merely an encoding algorithm to get a numeric value from binary data.

Not exactly. You have a table of characters, right? For example A. Now you have to store a numeric value that will be interpreted as A. In ASCII they arbitrary decided that 65 is the numeric value that represents that character. UNICODE is a long list of characters (and rules to combine them), UTF-X are arbitrary representations used to store them as numeric values.

if so, how character set is applied?

"Character set" is a pretty vague sentence. With UNICODE character set you mean all characters available with UNICODE. If you mean code page then (simplifying) it represents a subset of available character set. Imagine you have 8 bit ASCII (then up to 256 symbols), you can't accommodate all characters used in Europe, right? Code pages solve this problem, half of these symbols are always the same and the other half represent different characters according to code page (each "Country" will use a specific code page with its preferred characters).

For an introductory overview about this topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

这篇关于UTF-8和ISO 8859-9的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆