“修复"Java中的字符串编码 [英] "Fix" String encoding in Java

查看：17 发布时间：2021/12/27 15:49:30 java encoding

本文介绍了“修复"Java中的字符串编码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个从 byte[] 数组创建的 String，使用 UTF-8 编码.
但是，它应该是使用另一种编码 (Windows-1252) 创建的.

I have a String created from a byte[] array, using UTF-8 encoding.
However, it should have been created using another encoding (Windows-1252).

有没有办法将此字符串转换回正确的编码?

Is there a way to convert this String back to the right encoding?

我知道如果您可以访问原始字节数组，这很容易做到，但在我的情况下为时已晚，因为它是由封闭源库提供的.

I know it's easy to do if you have access to the original byte array, but it my case it's too late because it's given by a closed source library.

推荐答案

由于对于这是否可能存在一些混淆，我想我需要提供一个广泛的示例.

As there seems to be some confusion on whether this is possible or not I think I'll need to provide an extensive example.

问题声称(初始)输入是一个 byte[] 包含 Windows-1252 编码数据.我将其称为 byte[] ib(用于初始字节").

The question claims that the (initial) input is a byte[] that contains Windows-1252 encoded data. I'll call that byte[] ib (for "initial bytes").

在本例中，我将选择德语单词Bär"；(意思是熊)作为输入:

For this example I'll choose the German word "Bär" (meaning bear) as the input:

byte[] ib = new byte[] { (byte) 0x42, (byte) 0xE4, (byte) 0x72 };
String correctString = new String(ib, "Windows-1252");
assert correctString.charAt(1) == 'u00E4'; //verify that the character was correctly decoded.

(如果您的 JVM 不支持该编码，那么您可以改用 ISO-8859-1，因为这三个字母(以及大多数其他字母)在这两种编码中位于相同的位置.

(If your JVM doesn't support that encoding, then you can use ISO-8859-1 instead, because those three letters (and most others) are at the same position in those two encodings).

问题继续说明某些其他代码(不在我们的影响范围内)已经使用 UTF-8 编码将该 byte[] 转换为字符串(我将其称为 String is 表示输入字符串").String 是实现我们目标的唯一输入(如果 ib 可用，那将是微不足道的):

The question goes on to state that some other code (that is outside of our influence) already converted that byte[] to a String using the UTF-8 encoding (I'll call that String is for "input String"). That String is the only input that is available to achieve our goal (if ib were available, it would be trivial):

String is = new String(ib, "UTF-8");
System.out.println(is);

这显然会产生不正确的输出B ".

This obviously produces the incorrect output "B�".

目标是生成 ib(或该 byte[] 的正确解码)，only is 可用.


The goal would be to produce ib (or the correct decoding of that byte[]) with only is available.
现在有些人声称从 is 获取 UTF-8 编码的字节将返回一个与初始数组具有相同值的数组:
Now some people claim that getting the UTF-8 encoded bytes from that is will return an array with the same values as the initial array:
byte[] utf8Again = is.getBytes("UTF-8");

但这会返回两个字符 B 和   的 UTF-8 编码，并且在重新解释为 Windows-1252 时肯定会返回错误的结果:
But that returns the UTF-8 encoding of the two characters B and � and definitely returns the wrong result when re-interpreted as Windows-1252:
System.out.println(new String(utf8Again, "Windows-1252");

这一行产生输出Bï¿½"，这是完全错误的(如果初始数组包含非单词Bür"，它也是相同的输出).
This line produces the output "Bï¿½", which is totally wrong (it is also the same output that would be the result if the initial array contained the non-word "Bür" instead).
所以在这种情况下您无法撤消操作，因为某些信息丢失了.
So in this case you can't undo the operation, because some information was lost.
存在实际上可以撤销此类错误编码的情况.当所有可能(或至少出现)的字节序列在该编码中都有效时，它更有可能起作用.由于 UTF-8 有几个字节序列根本不是有效值，因此您会遇到问题.
There are in fact cases where such mis-encodings can be undone. It's more likely to work, when all possible (or at least occuring) byte sequences are valid in that encoding. Since UTF-8 has several byte sequences that are simply not valid values, you will have problems.

                        这篇关于“修复"Java中的字符串编码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

“修复"Java中的字符串编码 [英] "Fix" String encoding in Java

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

“修复"Java中的字符串编码 [英] &quot;Fix&quot; String encoding in Java

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

“修复"Java中的字符串编码 [英] "Fix" String encoding in Java

登录关闭