ISO-8859-1 编码和二进制数据保存 [英] ISO-8859-1 encoding and binary data preservation

查看:36
本文介绍了ISO-8859-1 编码和二进制数据保存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我阅读了评论 @Esailija 对我的一个问题的回答

<块引用>

ISO-8859-1 是唯一完全保留原始二进制数据的编码,具有精确的字节<->代码点匹配

我还在@AaronDigulla 的这个答案中读到:

<块引用>

在 Java 中,ISO-8859-1(又名 ISO-Latin1)是一个 1:1 映射

我需要对此有所了解.这将失败(如此处所示):

//u00F6 是 öSystem.out.println(Arrays.toString("u00F6".getBytes("utf-8")));//打印 [-61, -74]System.out.println(Arrays.toString("u00F6".getBytes("ISO-8859-1")));//打印 [-10]

问题

  1. 我承认我不太明白 - 为什么它没有获取上面代码中的字节?
  2. 最重要的是,这是哪里(字节保留行为ISO-8859-1)specified - 指向源代码或 JSL 的链接会很好.这是唯一具有此属性的编码吗?
  3. 是否与 ISO-8859-1 成为默认默认值有关?莉>

另见这个问题 来自其他字符集的很好的反例.

解决方案

"u00F6" 不是字节数组.它是一个包含单个字符的字符串.改为执行以下测试:

public static void main(String[] args) 抛出异常 {byte[] b = new byte[] {(byte) 0x00, (byte) 0xf6};String s = new String(b, ISO-8859-1");//解码byte[] b2 = s.getBytes(ISO-8859-1");//编码System.out.println("字节数是否相等:" + Arrays.equals(b, b2));//真的}

要检查这是否适用于任何字节,只需改进代码并遍历所有字节:

public static void main(String[] args) 抛出异常 {字节[] b = 新字节[256];for (int i = 0; i < b.length; i++) {b[i] = (字节) i;}String s = new String(b, ISO-8859-1");byte[] b2 = s.getBytes(ISO-8859-1");System.out.println("字节数是否相等:" + Arrays.equals(b, b2));}

ISO-8859-1 是一种标准编码.所以使用的语言(Java、C# 或其他)并不重要.

这里有一个 维基百科参考,声称涵盖了每个字节:><块引用>

1992 年,IANA 注册了字符映射 ISO_8859-1:1987,更广为人知的是其首选的 MIME 名称 ISO-8859-1(注意 ISO 8859-1 上的额外连字符),它是 ISO 8859-的超集1、在互联网上使用.此映射将 C0 和 C1 控制字符分配给未分配的代码值从而通过每个可能的 8 位值提供 256 个字符.

(强调我的)

I read in a comment to an answer by @Esailija to a question of mine that

ISO-8859-1 is the only encoding to fully retain the original binary data, with exact byte<->codepoint matches

I also read in this answer by @AaronDigulla that :

In Java, ISO-8859-1 (a.k.a ISO-Latin1) is a 1:1 mapping

I need some insight on this. This will fail (as illustrated here) :

// u00F6 is ö
System.out.println(Arrays.toString("u00F6".getBytes("utf-8")));
// prints [-61, -74]
System.out.println(Arrays.toString("u00F6".getBytes("ISO-8859-1")));
// prints [-10]

Questions

  1. I admit I do not quite get it - why does it not get the bytes in the code above ?
  2. Most importantly, where is this (byte preserving behavior of ISO-8859-1) specified - links to source, or JSL would be nice. Is it the only encoding with this property ?
  3. Is it related to ISO-8859-1 being the default default ?

See also this question for nice counter examples from other charsets.

解决方案

"u00F6" is not a byte array. It's a string containing a single char. Execute the following test instead:

public static void main(String[] args) throws Exception {
    byte[] b = new byte[] {(byte) 0x00, (byte) 0xf6};
    String s = new String(b, "ISO-8859-1"); // decoding
    byte[] b2 = s.getBytes("ISO-8859-1"); // encoding
    System.out.println("Are the bytes equal : " + Arrays.equals(b, b2)); // true
}

To check that this is true for any byte, just improve the code an loop through all the bytes:

public static void main(String[] args) throws Exception {
    byte[] b = new byte[256];
    for (int i = 0; i < b.length; i++) {
        b[i] = (byte) i;
    }
    String s = new String(b, "ISO-8859-1");
    byte[] b2 = s.getBytes("ISO-8859-1");
    System.out.println("Are the bytes equal : " + Arrays.equals(b, b2));
}

ISO-8859-1 is a standard encoding. So the language used (Java, C# or whatever) doesn't matter.

Here's a Wikipedia reference that claims that every byte is covered:

In 1992, the IANA registered the character map ISO_8859-1:1987, more commonly known by its preferred MIME name of ISO-8859-1 (note the extra hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on the Internet. This map assigns the C0 and C1 control characters to the unassigned code values thus provides for 256 characters via every possible 8-bit value.

(emphasis mine)

这篇关于ISO-8859-1 编码和二进制数据保存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆