JSoup字符编码问题 [英] JSoup character encoding issue

查看:404
本文介绍了JSoup字符编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用JSoup解析 http://www.latijnengrieks.com/中的内容vertaling.php?id = 5368 .这是第三方网站,未指定正确的编码.我正在使用以下代码加载数据:

I am using JSoup to parse content from http://www.latijnengrieks.com/vertaling.php?id=5368 . this is a third party website and does not specify proper encoding. i am using the following code to load the data:

public class Loader {

    public static void main(String[] args){
        String url = "http://www.latijnengrieks.com/vertaling.php?id=5368";

        Document doc;
        try {

            doc = Jsoup.connect(url).timeout(5000).get();
            Element content = doc.select("div.kader").first();
            Element contenttableElement = content.getElementsByClass("kopje").first().parent().parent();

            String contenttext = content.html();
            String tabletext = contenttableElement.html();

            contenttext = Jsoup.parse(contenttext).text();
            contenttext = contenttext.replace("br2n", "\n");
            tabletext = Jsoup.parse(tabletext.replaceAll("(?i)<br[^>]*>", "br2n")).text();
            tabletext = tabletext.replace("br2n", "\n");

            String text = contenttext.substring(tabletext.length(), contenttext.length());
            System.out.println(text);


        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }


    }    

}

这将提供以下输出:

Aeneas dwaalt rond in Troje en zoekt Cre?sa. Cre?sa is echter op de vlucht gestorven Plotseling verschijnt er een schim. Het is de schim van Cre?sa. De schim zegt:'De oorlog woedt!' Troje is ingenomen! Cre?sa is gestorven:'Vlucht!' Aeneas vlucht echter niet. Dan spreekt de schim:'Vlucht! Er staat jou een nieuw vaderland en een nieuw koninkrijk te wachten.' Dan pas gehoorzaamt Aeneas en vlucht.

有什么办法吗?标记可以再次在输出中成为原始(ü)吗?

is there any way the ? marks can be the original (ü) again in the output?

推荐答案

HTTP响应Content-Type标头中缺少charset属性.解析HTML时,Jsoup将使用平台默认字符集. Document.OutputSettings#charset()将不起作用,因为它仅用于演示(在html()text()上),而不是用于解析数据(换句话说,为时已晚).

The charset attribute is missing in HTTP response Content-Type header. Jsoup will resort to platform default charset when parsing the HTML. The Document.OutputSettings#charset() won't work as it's used for presentation only (on html() and text()), not for parsing the data (in other words, it's too late already).

您需要将URL读取为InputStream,并在Jsoup#parse()方法中手动指定字符集.

You need to read the URL as InputStream and manually specify the charset in Jsoup#parse() method.

String url = "http://www.latijnengrieks.com/vertaling.php?id=5368";
Document document = Jsoup.parse(new URL(url).openStream(), "ISO-8859-1", url);
Element paragraph = document.select("div.kader p").first();

for (Node node : paragraph.childNodes()) {
    if (node instanceof TextNode) {
        System.out.println(((TextNode) node).text().trim());
    }
}

结果在这里

Aeneas dwaalt rond in Troje en zoekt Creüsa.
Creüsa is echter op de vlucht gestorven
Plotseling verschijnt er een schim.
Het is de schim van Creüsa.
De schim zegt:'De oorlog woedt!'
Troje is ingenomen!
Creüsa is gestorven:'Vlucht!'
Aeneas vlucht echter niet.
Dan spreekt de schim:'Vlucht! Er staat jou een nieuw vaderland en een nieuw koninkrijk te wachten.'
Dan pas gehoorzaamt Aeneas en vlucht.

这篇关于JSoup字符编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆