从Windows和Linux读取文件产生不同的结果(字符编码?) [英] Reading File from Windows and Linux yields different results (character encoding?)

查看:741
本文介绍了从Windows和Linux读取文件产生不同的结果(字符编码?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前,我正在尝试读取mime格式的文件,其中包含一些png的二进制字符串数据。

Currently I'm trying to read a file in a mime format which has some binary string data of a png.

在Windows中,正确的二进制字符串,意思是我只是复制字符串,并将扩展名更改为png,我看到了图片。

In Windows, reading the file gives me the proper binary string, meaning I just copy the string over and change the extension to png and I see the picture.

在Windows中读取文件后的示例如下:

An example after reading the file in Windows is below:

    --fh-mms-multipart-next-part-1308191573195-0-53229
     Content-Type: image/png;name=app_icon.png
     Content-ID: "<app_icon>"
     content-location: app_icon.png

    ‰PNG

etc ... etc ...

etc...etc...

在Linux中读取文件之后的示例如下:

An example after reading the file in Linux is below:

    --fh-mms-multipart-next-part-1308191573195-0-53229
     Content-Type: image/png;name=app_icon.png
     Content-ID: "<app_icon>"
     content-location: app_icon.png

     �PNG

etc ... etc ...

etc...etc...

我无法将Linux版本转换为图片成为一些有很多颠倒的时髦符号?和1/2符号。

I am not able to convert the Linux version into a picture as it all becomes some funky symbols with a lot of upside down "?" and "1/2" symbols.

任何人都可以告诉我发生了什么,并可能提供解决方案?

Can anyone enlighten me on what is going on and maybe provide a solution? Been playing with the code for a week and more now.

推荐答案

�是三个字符的序列 - 0xEF 0xBF 0xBD ,并且是Unicode码点 0xFFFD 的UTF-8表示。代码点本身就是非法UTF-8序列的替换字符

� is a sequence of three characters - 0xEF 0xBF 0xBD, and is UTF-8 representation of the Unicode codepoint 0xFFFD. The codepoint in itself is the replacement character for illegal UTF-8 sequences.

显然,由于某种原因,您的源代码中涉及的一组例程(在Linux上)正在不准确地处理PNG头。 PNG标头以字节 0x89 开头(并且后跟 0x50 0x4E 0x47 )在Windows中正确处理(可能将文件视为CP1252字节序列)。在 CP1252 中, 0x89 字符会显示为

Apparently, for some reason, the set of routines involved in your source code (on Linux) is handling the PNG header inaccurately. The PNG header starts with the byte 0x89 (and is followed by 0x50, 0x4E, 0x47), which is correctly handled in Windows (which might be treating the file as a sequence of CP1252 bytes). In CP1252, the 0x89 character is displayed as .

然而,在Linux上,这个字节正在被一个UTF-8例程它是好的处理文件作为UTF-8序列)。由于0x89本身不是ASCII-7范围内的有效代码点(参考: UTF-8编码方案< a>),它不能映射到0x00-0x7F范围中的有效UTF-8代码点。此外,它不能映射到表示为多字节UTF-8序列的有效码点,对于所有多字节序列以至少2位设置为1开始( 11 .... ),并且由于这是文件的开始,它不能是连续字节。其结果是UTF-8解码器现在用UTF-8替换字符 0xEF <$替换 0x89 c $ c> 0xBF 0xBD (多么愚蠢,考虑到文件不是UTF-8开头) a href =http://en.wikipedia.org/wiki/ISO/IEC_8859-1> ISO-8859-1 �

On Linux, however, this byte is being decoded by a UTF-8 routine (or a library that thought it was good to process the file as a UTF-8 sequence). Since, 0x89 on it's own is not a valid codepoint in the ASCII-7 range (ref: the UTF-8 encoding scheme), it cannot be mapped to a valid UTF-8 codepoint in the 0x00-0x7F range. Also, it cannot be mapped to a valid codepoint represented as a multi-byte UTF-8 sequence, for all of multi-byte sequences start with a minimum of 2 bits set to 1 (11....), and since this is the start of the file, it cannot be a continuation byte as well. The resulting behavior is that the UTF-8 decoder, now replaces 0x89 with the UTF-8 replacement characters 0xEF 0xBF 0xBD (how silly, considering that the file is not UTF-8 to begin with), which will be displayed in ISO-8859-1 as �.

如果您需要解决此问题,您需要在Linux中确保以下内容:

If you need to resolve this problem, you'll need to ensure the following in Linux:


  • 使用文件的合适编码(即不是UTF-8)读取PNG文件中的字节;如果您以字符序列 * 读取文件,这显然是必要的,如果您正在单独读取字节,那么这不是必需的。

  • 当您查看文件的内容时,请使用合适的编辑器/视图不执行文件到UTF-8字节序列的任何内部解码。使用合适的字体也将有所帮助,因为你可能想要防止空格的情况,其中字形( 0xFFFD 它实际上是菱形字符 )无法表示,并且可能(不太可能,但你永远不知道编辑器/查看器是如何编写的)。

  • 这也是一个好主意,写出文件(如果你这样做)在合适的编码 - ISO-8859-1也许,而不是UTF-8。如果将文件内容作为字节而不是字符处理和存储在内存中,则将这些内容写入输出流(不涉及任何字符串或字符引用)就足够了。

  • Read the bytes in the PNG file, using the suitable encoding for the file (i.e. not UTF-8); this is apparently necessary if you are reading the file as a sequence of characters*, and not necessary if you are reading bytes alone. You might be doing this correctly, so it would be worthwhile to verify the subsequent step(s) also.
  • When you are viewing the contents of the file, use a suitable editor/view that does not perform any internal decoding of the file to a sequence of UTF-8 bytes. Using a suitable font will also help, for you might want to prevent the unprecedented scenario where the glyph (for 0xFFFD it is actually the diamond character �) cannot be represented, and might result in further changes (unlikely, but you never know how the editor/viewer has been written).
  • It is also a good idea to write the files out (if you are doing so) in the suitable encoding - ISO-8859-1 perhaps, instead of UTF-8. If you are processing and storing the file contents in memory as bytes instead of characters, then writing these to an output stream (without the involvement of any String or character references) is sufficient.

* 显然,如果将字节序列转换为字符或字符串,Java运行时将执行将字节序列解码为UTF-16代码点对象。

* Apparently, the Java Runtime will perform decoding of the byte sequence to UTF-16 codepoints, if you convert a sequence of bytes to a character or a String object.

这篇关于从Windows和Linux读取文件产生不同的结果(字符编码?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆