从 Windows 和 Linux 读取文件会产生不同的结果(字符编码?) [英] Reading File from Windows and Linux yields different results (character encoding?)

查看:28
本文介绍了从 Windows 和 Linux 读取文件会产生不同的结果(字符编码?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前我正在尝试读取一个 mime 格式的文件,其中包含一些 png 的二进制字符串数据.

Currently I'm trying to read a file in a mime format which has some binary string data of a png.

在 Windows 中,读取文件会给我正确的二进制字符串,这意味着我只需复制字符串并将扩展名更改为 png 即可看到图片.

In Windows, reading the file gives me the proper binary string, meaning I just copy the string over and change the extension to png and I see the picture.

在 Windows 中读取文件后的示例如下:

An example after reading the file in Windows is below:

    --fh-mms-multipart-next-part-1308191573195-0-53229
     Content-Type: image/png;name=app_icon.png
     Content-ID: "<app_icon>"
     content-location: app_icon.png

    ‰PNG

等等...等等...

在Linux中读取文件后的示例如下:

An example after reading the file in Linux is below:

    --fh-mms-multipart-next-part-1308191573195-0-53229
     Content-Type: image/png;name=app_icon.png
     Content-ID: "<app_icon>"
     content-location: app_icon.png

     �PNG

等等...等等...

我无法将 Linux 版本转换为图片,因为它都变成了一些带有很多颠倒?"的时髦符号.和1/2"符号.

I am not able to convert the Linux version into a picture as it all becomes some funky symbols with a lot of upside down "?" and "1/2" symbols.

任何人都可以启发我了解正在发生的事情并提供解决方案吗?已经玩了一个多星期的代码了.

Can anyone enlighten me on what is going on and maybe provide a solution? Been playing with the code for a week and more now.

推荐答案

� 是三个字符的序列 - 0xEF 0xBF0xBD,是 Unicode 代码点 0xFFFD 的 UTF-8 表示.代码点本身就是非法 UTF-8 序列的替换字符.

� is a sequence of three characters - 0xEF 0xBF 0xBD, and is UTF-8 representation of the Unicode codepoint 0xFFFD. The codepoint in itself is the replacement character for illegal UTF-8 sequences.

显然,出于某种原因,您的源代码(在 Linux 上)中涉及的一组例程无法准确地处理 PNG 标头.PNG 标头以字节 0x89 开头(后跟 0x500x4E0x47),在 Windows 中正确处理(可能将文件视为 CP1252 字节序列).在CP1252中,0x89字符显示为.

Apparently, for some reason, the set of routines involved in your source code (on Linux) is handling the PNG header inaccurately. The PNG header starts with the byte 0x89 (and is followed by 0x50, 0x4E, 0x47), which is correctly handled in Windows (which might be treating the file as a sequence of CP1252 bytes). In CP1252, the 0x89 character is displayed as .

然而,在 Linux 上,这个字节由 UTF-8 例程(或认为将文件作为 UTF-8 序列处理是好的库)解码.因为, 0x89 本身不是 ASCII-7 范围内的有效代码点(参考:UTF-8 编码方案),它无法映射到 0x00-0x7F 范围内的有效 UTF-8 代码点.此外,它无法映射到表示为多字节 UTF-8 序列的有效代码点,因为所有多字节序列都以至少 2 位设置为 1 (11....),并且由于这是文件的开头,因此它也不能是连续字节.结果是 UTF-8 解码器现在将 0x89 替换为 UTF-8 替换字符 0xEF 0xBF 0xBD(多么愚蠢,考虑到该文件不是 UTF-8),它将显示在 ISO 中-8859-1�.

On Linux, however, this byte is being decoded by a UTF-8 routine (or a library that thought it was good to process the file as a UTF-8 sequence). Since, 0x89 on it's own is not a valid codepoint in the ASCII-7 range (ref: the UTF-8 encoding scheme), it cannot be mapped to a valid UTF-8 codepoint in the 0x00-0x7F range. Also, it cannot be mapped to a valid codepoint represented as a multi-byte UTF-8 sequence, for all of multi-byte sequences start with a minimum of 2 bits set to 1 (11....), and since this is the start of the file, it cannot be a continuation byte as well. The resulting behavior is that the UTF-8 decoder, now replaces 0x89 with the UTF-8 replacement characters 0xEF 0xBF 0xBD (how silly, considering that the file is not UTF-8 to begin with), which will be displayed in ISO-8859-1 as �.

如果您需要解决这个问题,您需要在 Linux 中确保以下几点:

If you need to resolve this problem, you'll need to ensure the following in Linux:

  • 使用适合文件的编码(即非 UTF-8)读取 PNG 文件中的字节;如果您将文件作为字符序列读取*,这显然是必要的,而如果您单独读取字节则不需要.您可能会正确执行此操作,因此也值得验证后续步骤.
  • 当您查看文件内容时,请使用合适的编辑器/视图,该编辑器/视图不会将文件执行任何内部解码为 UTF-8 字节序列.使用合适的字体也将有所帮助,因为您可能希望防止字形(对于 0xFFFD 实际上是菱形字符 )无法表示的前所未有的情况,并且可能导致进一步的更改(不太可能,但您永远不知道编辑器/查看器是如何编写的).
  • 以合适的编码(也许是 ISO-8859-1,而不是 UTF-8)写出文件(如果您这样做的话)也是一个好主意.如果您将文件内容作为字节而不是字符处理和存储在内存中,那么将这些内容写入输出流(不涉及任何字符串或字符引用)就足够了.
  • Read the bytes in the PNG file, using the suitable encoding for the file (i.e. not UTF-8); this is apparently necessary if you are reading the file as a sequence of characters*, and not necessary if you are reading bytes alone. You might be doing this correctly, so it would be worthwhile to verify the subsequent step(s) also.
  • When you are viewing the contents of the file, use a suitable editor/view that does not perform any internal decoding of the file to a sequence of UTF-8 bytes. Using a suitable font will also help, for you might want to prevent the unprecedented scenario where the glyph (for 0xFFFD it is actually the diamond character �) cannot be represented, and might result in further changes (unlikely, but you never know how the editor/viewer has been written).
  • It is also a good idea to write the files out (if you are doing so) in the suitable encoding - ISO-8859-1 perhaps, instead of UTF-8. If you are processing and storing the file contents in memory as bytes instead of characters, then writing these to an output stream (without the involvement of any String or character references) is sufficient.

* 显然,如果将字节序列转换为字符或字符串对象,Java 运行时会将字节序列解码为 UTF-16 代码点.

* Apparently, the Java Runtime will perform decoding of the byte sequence to UTF-16 codepoints, if you convert a sequence of bytes to a character or a String object.

这篇关于从 Windows 和 Linux 读取文件会产生不同的结果(字符编码?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆