在 Java 中转换 HTML 字符编码 [英] Converting HTML character encoding in Java

查看:45
本文介绍了在 Java 中转换 HTML 字符编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在尝试下载网页的源代码,但是由于字符编码,我们无法正确看到某些特定字符,例如 ü,ö,ş,ç-.我们尝试了以下代码来转换字符串(text"变量)的编码:

We are trying to download source of webpages, however we cannot see some specific characters -like ü,ö,ş,ç- propoerly due to character encoding. We tried the following code in order to convert encoding of the string ("text" variable):

byte[] xyz = text.getBytes();
text = new String(xyz,"windows-1254"); 

我们观察到,如果编码为utf-8,我们仍然无法正确查看页面.我们该怎么办?

We observed that if encoding is utf-8, we still cannot see pages correctly. What should we do?

推荐答案

如果您知道页面将其内容编码为 UTF-8,则告诉 String 构造函数使用 UTF-8 编码来解释字节.

Tell the String constructor to use the UTF-8 encoding to interpret the bytes, if you know the page encodes its contents as UTF-8.

但是我不确定这是您问题的严重程度.在尝试转换"它之前,您已经有了文本".这意味着已经根据某种编码尝试将页面的字节解释为字符串.如果那是错误的编码,那么您以后所做的任何事情都不一定能修复它.

However I am not sure this is the extent of your problem. You have "text" already before trying to "convert" it. This means something has already tried to interpret the bytes of the page as a String, according to some encoding. If that was the wrong encoding, nothing you do later can necessarily fix it.

相反,您需要在上游修复此问题.

Instead you need to fix this upstream.

byte[] bytesOfThePage = ...;
String text = new String(bytesOfThePage, "UTF-8");

这篇关于在 Java 中转换 HTML 字符编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆