"修复" Java编码 [英] "Fix" String encoding in Java

查看:142
本文介绍了"修复" Java编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用UTF-8编码从 byte [] 数组创建了一个 String

但是,它应该是使用其他编码(Windows-1252)创建的。



有没有办法将此String转换为正确的编码? / p>

我知道如果您可以访问原始字节数组,那么很容易,但是我的情况太晚了,因为它是由一个封闭的源库提供的。

解决方案

由于似乎有一些混淆是否这是可能的,我想我需要提供一个广泛的例子。 >

该问题声称(初始)输入是 byte [] ,其中包含 Windows-1252 编码数据。我会调用这个 byte [] ib (对于初始字节)。



对于这个例子,我将选择德语单词Bär(含义)作为输入:

  byte [] ib = new byte [] {(byte)0x42,(byte)0xE4,(byte)0x72}; 
String correctString = new String(ib,Windows-1252);
assert correctString.charAt(1)=='\\\ä'; //验证字符是否正确解码。

(如果您的JVM不支持该编码,则可以使用ISO-8859-1 ,因为这三个字母(和其他大多数)在这两个编码中处于相同的位置)。



问题在于说明一些其他代码我们的影响)已经使用UTF-8编码将 byte [] 转换为String(我会称之为 String 是 forinput String)。 String 是用于实现我们目标的唯一输入(如果可用,这将是微不足道的):

  String is = new String(ib,UTF-8); 
System.out.println(is);

这显然会产生错误的输出B 。



目标是生成 ib (或正确解码 byte [] )与 可以使用



现在有人声称获得UTF-8编码中的 将返回与初始数组相同值的数组:

  byte [] utf8Again = is.getBytes(UTF-8); 

但是,返回两个字符的$ UTF-8编码 B ,并重新解释为Windows-1252时肯定返回错误的结果:

  System.out.println(new String(utf8Again,Windows-1252); 

此行产生输出B�,这完全是错误的(如果初始数组包含非单词Bür,则也是相同的输出)。 p>

所以在这种情况下 ,因为信息丢失,所以无法撤消操作。



在这种错误编码可以撤消的情况下,当所有可能(或至少发生))字节序列在该编码中有效时,它更有可能工作,因为UTF-8有几个字节序列根本无效,您有问题。


I have a String created from a byte[] array, using UTF-8 encoding.
However, it should have been created using another encoding (Windows-1252).

Is there a way to convert this String back to the right encoding?

I know it's easy to do if you have access to the original byte array, but it my case it's too late because it's given by a closed source library.

解决方案

As there seems to be some confusion on whether this is possible or not I think I'll need to provide an extensive example.

The question claims that the (initial) input is a byte[] that contains Windows-1252 encoded data. I'll call that byte[] ib (for "initial bytes").

For this example I'll choose the German word "Bär" (meaning bear) as the input:

byte[] ib = new byte[] { (byte) 0x42, (byte) 0xE4, (byte) 0x72 };
String correctString = new String(ib, "Windows-1252");
assert correctString.charAt(1) == '\u00E4'; //verify that the character was correctly decoded.

(If your JVM doesn't support that encoding, then you can use ISO-8859-1 instead, because those three letters (and most others) are at the same position in those two encodings).

The question goes on to state that some other code (that is outside of our influence) already converted that byte[] to a String using the UTF-8 encoding (I'll call that String is for "input String"). That String is the only input that is available to achieve our goal (if is were available, it would be trivial):

String is = new String(ib, "UTF-8");
System.out.println(is);

This obviously produces the incorrect output "B�".

The goal would be to produce ib (or the correct decoding of that byte[]) with only is available.

Now some people claim that getting the UTF-8 encoded bytes from that is will return an array with the same values as the initial array:

byte[] utf8Again = is.getBytes("UTF-8");

But that returns the UTF-8 encoding of the two characters B and and definitely returns the wrong result when re-interpreted as Windows-1252:

System.out.println(new String(utf8Again, "Windows-1252");

This line produces the output "B�", which is totally wrong (it is also the same output that would be the result if the initial array contained the non-word "Bür" instead).

So in this case you can't undo the operation, because information is lost.

There are in fact cases where such mis-encodings can be undone. It's more likely to work, when all possible (or at least occuring) byte sequences are valid in that encoding. Since UTF-8 has several byte sequences that are simply not valid values, you will have problems.

这篇关于"修复" Java编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆