在 Java 中转换 HTML 字符编码 [英] Converting HTML character encoding in Java
问题描述
我们正在尝试下载网页的源代码,但是由于字符编码,我们无法正确看到某些特定字符,例如 ü,ö,ş,ç-.我们尝试了以下代码来转换字符串(text"变量)的编码:
We are trying to download source of webpages, however we cannot see some specific characters -like ü,ö,ş,ç- propoerly due to character encoding. We tried the following code in order to convert encoding of the string ("text" variable):
byte[] xyz = text.getBytes();
text = new String(xyz,"windows-1254");
我们观察到,如果编码为utf-8,我们仍然无法正确查看页面.我们该怎么办?
We observed that if encoding is utf-8, we still cannot see pages correctly. What should we do?
推荐答案
如果您知道页面将其内容编码为 UTF-8,则告诉 String 构造函数使用 UTF-8 编码来解释字节.
Tell the String constructor to use the UTF-8 encoding to interpret the bytes, if you know the page encodes its contents as UTF-8.
但是我不确定这是您问题的严重程度.在尝试转换"它之前,您已经有了文本".这意味着已经根据某种编码尝试将页面的字节解释为字符串.如果那是错误的编码,那么您以后所做的任何事情都不一定能修复它.
However I am not sure this is the extent of your problem. You have "text" already before trying to "convert" it. This means something has already tried to interpret the bytes of the page as a String, according to some encoding. If that was the wrong encoding, nothing you do later can necessarily fix it.
相反,您需要在上游修复此问题.
Instead you need to fix this upstream.
byte[] bytesOfThePage = ...;
String text = new String(bytesOfThePage, "UTF-8");
这篇关于在 Java 中转换 HTML 字符编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!