Windows-1252(1/3/4)和ISO-8859-1之间的确切区别是什么? [英] What is the exact difference between Windows-1252(1/3/4) and ISO-8859-1?

查看:249
本文介绍了Windows-1252(1/3/4)和ISO-8859-1之间的确切区别是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在基于Debian的LAMP安装中托管PHP应用程序。
一切都很好 - 性能,行政和管理明智。
然而,作为一个新的开发者(我们还在高中),我们遇到了一些西方字符集的字符编码的问题。



经过大量的研究,我得出的结论是,在线信息有点混乱。它谈到Windows-1252是ANSI和完全ISO-8859-1兼容。



那么无论如何,Windows-1252(1/3/4)和ISO-8859-1有什么区别?
在这里,ANSI是怎么来的?



我们在Debian服务器(和工作站)上使用什么编码,以确保客户端获得所有信息

解决方案

我想回答这个问题更像网络的方式,为了回答它,所以我们需要一个有点历史。 Joel Spolsky 撰写了一篇非常好的介绍文章绝对最小每个开发应该知道关于Unicode字符编码。
与我在这里,因为这将是一个 looong 回答。 :)



作为历史,我会指出一些引语:(非常感谢Joel!:))


唯一需要的字符是好的老的非英语英语字母,我们有一个代码为ASCII,它能够使用32和127之间的数字表示每个字符。空格是32,字母A为65等。这可以方便地存储在7位。这些日子里的大多数计算机都使用8位字节,所以不仅可以存储每一个可能的ASCII字符,但是你有一点剩余,如果你是邪恶的,你可以使用自己的狡猾的目的。 p>

一切都很好,假设你是一个英语演讲者。
因为字节有多达8位的空间,很多人都在想,gosh,我们可以使用代码128-255为我们自己的目的。麻烦的是,很多人在同一时间有这个想法,他们有自己的想法应该去哪里的空间从128到255.


现在,OEM字符集与PC一起分发,这些仍然是不同的和不兼容的。对我们当代的惊奇 - 这一切都很好!他们没有互联网,人们很少在不同地区的系统之间交换文件。



乔尔说:


事实上,一旦人们开始购买美国以外的电脑,各种不同的OEM字符集都被梦想了,所有的使用前128个字符为自己的目的。
最终,这个OEM免费为所有人编写了ANSI标准。在ANSI标准中,每个人都同意在128以下做什么,这与ASCII几乎相同,但是有很多不同的方法来处理128和以上的字符,这取决于你住在哪里。这些不同的系统称为代码页


这就是Windows代码页如何诞生的,最终。他们实际上是由DOS代码页父母。然后Unicode诞生了! :)和 UTF-8 是另一个用于存储Unicode代码点字符串的系统,实际上是每个代码点从0-127存储在单个字节,并且与 ASCII 相同。我不会再详细介绍Unicode和UTF-8,但您应该阅读 BOM 字尾字符编码一个一般。



@Jukka K. Korpela是权利的钱说,最可能你是指 Windows-1252

在ANSI阴谋上,Microsoft实际上承认错过标签a 词汇表


所谓的Windows字符集(WinLatin1,或Windows代码页1252,准确)使用这些位置的一些可打印字符。因此,Windows字符集与ISO 8859-1不一致。 Windows字符集通常称为ANSI字符集,但这是严重错误。


因此, ANSI引用Windows字符集时不符合ANSI认证<强>! :)



正如Jukka指出的(信用证给你的答案很好)


Windows-1252 ISO Latin 1,也称为ISO-8859-1作为字符编码,因此代码范围0x80到0x9F保留用于ISO-8859-1中的控制字符(所谓的C1控件)在Windows-1252中,一些代码被分配给可打印字符(主要是标点符号),其他代码未定义。


但是,我个人的意见和技术理解是,Windows-1252和ISO-8859-1 不是WEB编码! :)所以:




  • 对于网页,请使用UTF-8作为内容的编码
    所以将数据存储为UTF-8,并与 HTTP标头一起吐出: Content-Type: text / html; charset = utf-8



    还有一个名为 HTML content-type meta-tag
    < html>
    < head>
    < meta http-equiv =Content-Typecontent =text / html; charset = utf-8>

    现在,遇到这个标记是他们从HTML文档的开始再次开始,以便他们可以在声明的编码中重新解释文档。


  • 如果系统的用户需要从其生成的文件,请使用其他特定编码。
    例如,一些西方用户可能需要Excel生成的文件,或Windows-1252中的CSV。如果是这种情况,请对该语言环境中的文本进行编码,然后将其存储在fs上,并将其作为可下载文件提供。


  • 在HTTP的设计中注意:
    内容编码分配机制应该像这样工作。



    客户端通过以下方式请求特定内容类型和编码的网页:Accept和Accept-Charset 请求标头



    II。然后服务器(或web应用程序)返回转码为该编码和字符集的内容。




这是大多数现代网络应用程序中的情况。实际上发生的是,Web应用程序将(强制客户端)内容作为UTF-8。这是因为浏览器根据响应头文件解释收到的文件,而不是他们实际预期的。



我们都应该使用Unicode,所以请使用UTF -8尽可能分发您的内容,最重要的是适用。否则互联网的老年人会困扰你! :)




有关在网页中使用MS Windows字符的一些更好的文章,请访问此处此处


We are hosting PHP apps on a Debian based LAMP installation. Everything is quite ok - performance, administrative and management wise. However being a somewhat new devs (we're still in high-school) we've run into some problems with the character encoding for Western Charsets.

After doing a lot of researches I have come to the conclusion that the information online is somewhat confusing. It's talking about Windows-1252 being ANSI and totally ISO-8859-1 compatible.

So anyway, What is the difference between Windows-1252(1/3/4) and ISO-8859-1? And where does ANSI come into this anyway?

What encoding should we use on our Debian servers (and workstations) in order to ensure that clients get all information in the intended way and that we don't lose any chars on the way?

解决方案

I'd like to answer this in a more web-like manner and in order to answer it so we need a little history. Joel Spolsky has written a very good introductionary article on the absolute minimum every dev should know on Unicode Character Encoding. Bear with me here because this is going to be somewhat of a looong answer. :)

As a history I'll point to some quotes from there: (Thank you very much Joel! :) )

The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter "A" was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you could use for your own devious purposes.

And all was good, assuming you were an English speaker. Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255.

So now "OEM character sets" were distributed with PCs and these were still all different and incompatible. And to our contemporary amazement - it was all fine! They didn't have the Internet back than and people rarely exchanged files between systems with different locales.

Joel goes on saying:

In fact as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages.

And this is how the "Windows Code pages" were born, eventually. They were actually "parented" by the DOS code pages. And then Unicode was born! :) and UTF-8 is "another system for storing your string of Unicode code points" and actually "every code point from 0-127 is stored in a single byte" and is the same as ASCII. I will not go into anymore specifics of Unicode and UTF-8, but you should read up on the BOM, Endianness and Character Encoding as a general.

@Jukka K. Korpela is "right-on the money" saying that most-probably you are referring to Windows-1252.

On "the ANSI conspiracy", Microsoft actually admits the miss-labeling in a glossary of terms:

The so-called Windows character set (WinLatin1, or Windows code page 1252, to be exact) uses some of those positions for printable characters. Thus, the Windows character set is NOT identical with ISO 8859-1. The Windows character set is often called "ANSI character set", but this is SERIOUSLY MISLEADING. It has NOT been approved by ANSI.

So, ANSI when refering to Windows character sets is not ANSI-certified! :)

As Jukka pointed out (credits go to you for the nice answer )

Windows-1252 ISO Latin 1, also known as ISO-8859-1 as a character encoding, so that the code range 0x80 to 0x9F is reserved for control characters in ISO-8859-1 (so-called C1 Controls), wheres in Windows-1252, some of the codes there are assigned to printable characters (mostly punctuation characters), others are left undefined.

However my personal opinion and technical understanding is that both Windows-1252 and ISO-8859-1 ARE NOT WEB ENCODINGS! :) So:

  • For web pages please use UTF-8 as encoding for the content So store data as UTF-8 and "spit it out" with the HTTP Header: Content-Type: text/html; charset=utf-8.

    There is also a thing called the HTML content-type meta-tag: <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> Now, what browsers actually do when they encounter this tag is that they start from the beginning of the HTML document again so that they could reinterpret the document in the declared encoding. This should happen only if there is no 'Content-type' header.

  • Use other specific encodings if the users of your system need files generated from it. For example some western users may need Excel generated files, or CSVs in Windows-1252. If this is the case, encode text in that locale and then store it on the fs and serve it as a download-able file.

  • There is another thing to be aware of in the design of HTTP: The content-encoding distribution mechanism should work like this.

    I. The client requests a web page in a specific content-types and encodings via: the 'Accept' and 'Accept-Charset' request headers.

    II. Then the server (or web application) returns the content trans-coded to that encoding and character set.

This is NOT THE CASE in most modern web apps. What actually happens it that web applications serve (force the client) content as UTF-8. And this works because browsers interpret received documents based on the response headers and not on what they actually expected.

We should all go Unicode, so please, please, please use UTF-8 to distribute your content wherever possible and most of all applicable. Or else the elders of the Internet will haunt you! :)

P.S. Some more nice articles on using MS Windows characters in Web Pages can be found here and here.

这篇关于Windows-1252(1/3/4)和ISO-8859-1之间的确切区别是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆