在 Java 中将 UTF-8 转换为 ISO-8859-1 [英] Converting UTF-8 to ISO-8859-1 in Java

查看:44
本文介绍了在 Java 中将 UTF-8 转换为 ISO-8859-1的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读 XML 文档 (UTF-8) 并最终使用 ISO-8859-1 在网页上显示内容.正如预期的那样,有几个字符没有正确显示,例如(它们显示为?).

I am reading an XML document (UTF-8) and ultimately displaying the content on a Web page using ISO-8859-1. As expected, there are a few characters are not displayed correctly, such as ", and (they display as ?).

是否可以将这些字符从 UTF-8 转换为 ISO-8859-1?

Is it possible to convert these characters from UTF-8 to ISO-8859-1?

这是我为尝试这样做而编写的一段代码:

Here is a snippet of code I have written to attempt this:

BufferedReader br = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(), "UTF-8"));
StringBuilder sb = new StringBuilder();

String line = null;
while ((line = br.readLine()) != null) {
  sb.append(line);
}
br.close();

byte[] latin1 = sb.toString().getBytes("ISO-8859-1");

return new String(latin1);

我不太确定出了什么问题,但我相信是 readLine() 导致了问题(因为字符串是 Java/UTF-16 编码的?).我尝试的另一个变体是用

I'm not quite sure what's going awry, but I believe it's readLine() that's causing the grief (since the strings would be Java/UTF-16 encoded?). Another variation I tried was to replace latin1 with

byte[] latin1 = new String(sb.toString().getBytes("UTF-8")).getBytes("ISO-8859-1");

我已阅读有关该主题的以前的帖子,并且正在学习.预先感谢您的帮助.

I have read previous posts on the subject and I'm learning as I go. Thanks in advance for your help.

推荐答案

我不确定标准库中是否有一个规范化例程可以做到这一点.我不认为智能"引号的转换是由标准的 Unicode 规范化器 例程 - 但不要引用我.

I'm not sure if there is a normalization routine in the standard library that will do this. I do not think conversion of "smart" quotes is handled by the standard Unicode normalizer routines - but don't quote me.

明智的做法是转储 ISO-8859-1 和开始使用 UTF-8.也就是说,可以将任何通常允许的 Unicode 代码点编码到编码为 ISO-8859-1 的 HTML 页面中.您可以使用 转义序列 对它们进行编码,如下所示:>

The smart thing to do is to dump ISO-8859-1 and start using UTF-8. That said, it is possible to encode any normally allowed Unicode code point into a HTML page encoded as ISO-8859-1. You can encode them using escape sequences as shown here:

public final class HtmlEncoder {
  private HtmlEncoder() {}

  public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
      T out) throws java.io.IOException {
    for (int i = 0; i < sequence.length(); i++) {
      char ch = sequence.charAt(i);
      if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
        out.append(ch);
      } else {
        int codepoint = Character.codePointAt(sequence, i);
        // handle supplementary range chars
        i += Character.charCount(codepoint) - 1;
        // emit entity
        out.append("&#x");
        out.append(Integer.toHexString(codepoint));
        out.append(";");
      }
    }
    return out;
  }
}

示例用法:

String foo = "This is Cyrillic Ya: u044F
"
    + "This is fraktur G: uD835uDD0A
" + "This is a smart quote: u201C";

StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());
System.out.println(sb.toString());

上面,字符左双引号 ( U+201C ) 被编码为 &#x201C;.其他几个任意代码点也同样被编码.

Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C ) is encoded as &#x201C;. A couple of other arbitrary code points are likewise encoded.

需要注意这种方法.如果您的文本需要为 HTML 进行转义,则需要在上述代码或与号最终被转义之前完成.

Care needs to be taken with this approach. If your text needs to be escaped for HTML, that needs to be done before the above code or the ampersands end up being escaped.

这篇关于在 Java 中将 UTF-8 转换为 ISO-8859-1的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆