使用jsoup的奇怪的编码行为 [英] Strange encoding behaviour with jsoup

查看:146
本文介绍了使用jsoup的奇怪的编码行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从jsoup的不同页面的html源代码中提取一些信息。它们中的大多数都是UTF-8编码的。其中一个编码与ISO-8859-1,这导致一个奇怪的错误(在我的选择)。

I extract some information from the html sourcecode of different pages with jsoup. Most of them are UTF-8 encoded. One of them is encoded with ISO-8859-1, which leads to a strange error (in my optinion).

包含错误的页面是:
http://www.gudi.ch/armbanduhr- metall-wasserdicht-1280x960-megapixels-p-560.html

我用下面的代码读取了需要的String:

I read the needed String with the following piece of code:

Document doc = Jsoup.connect("http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html").userAgent("Mozilla").get();
String title = doc.getElementsByClass("products_name").first().text();

问题是字符串HD Armbanduhr aus Metall 4GB Wasserdicht 1280X960 - 5 millionels中的连字符。像öäü的正常变音符正确读取。只有此单个字符未输出为-

The problem is the hyphen in the String "HD Armbanduhr aus Metall 4GB Wasserdicht 1280X960 – 5 Megapixels". Normal umlauts like öäü are read correctly. Only this single character, which is not outputed as "& #45;" makes the problem.

我试图用out.outputSettings()。charset(ISO-8859-1)覆盖(正确设置)页面编码,没有帮助。

I tried to override the (correctly set) page-encoding with out.outputSettings().charset("ISO-8859-1") but that didn't help either.

接下来,我尝试改变字符串的编码与Charset类从和utf8和iso-8859-1手动。也没有运气。

Next, i tried do change the encoding of the string with the Charset class from and to utf8 and iso-8859-1 manually. Also no luck.

有人提示我可以尝试在用jsoup解析html文档后获得正确的字符?

Has someone a tip on what i can try to get the correct character after parsing the html document with jsoup?

感谢

推荐答案

这是网站本身的错误。它实际上是三个错误:

This is a mistake of the website itself. It are actually three mistakes:


  1. 没有 c $ c> Content-Type 响应头。在HTML元标记中有 ISO-8859-1 ,但是当通过HTTP提供页面时,会被忽略!平均网络浏览器将尝试智能检测或使用平台默认编码来对网页进行编码,这在Windows机器上是CP1252。

  1. The page is served without any charset in the HTTP Content-Type response header. There's ISO-8859-1 in the HTML meta tag, but this is ignored when the page is served over HTTP! The average webbrowser will either try smart detection or use platform default encoding to encode the webpage, which is CP1252 on Windows machines.

< meta> 标签假设内容已经ISO-8859-1编码,实际字符 - U + 2013 EN DASH 覆盖该字符集的所有。但是它是由CP1252字符集覆盖 0x0096

The <meta> tag pretends that the content is ISO-8859-1 encoded, but the actual character (U+2013 EN DASH) is not covered by that charset at all. It is however covered by the CP1252 charset as 0x0096.

根据网页源代码,产品名称使用文字字符 -

According to the webpage source code, the product name uses the literal character instead of the HTML entity &ndash; as spotted elsewhere on the same webpage.


$ b的HTML实体& ndash; $ b

Jsoup可以透明地修复很多错误发展的网页,但是这真的超出了Jsoup。您需要手动读取它,然后将其作为CP1252供给Jsoup。

Jsoup can fix many badly developed webpages transparently, but this one goes really beyond Jsoup. You need to manually read it in and then feed it as CP1252 to Jsoup.

String url = "http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html";
InputStream input = new URL(url).openStream();
Document doc = Jsoup.parse(input, "CP1252", url);
String title = doc.select(".products_name").first().text();
// ...

这篇关于使用jsoup的奇怪的编码行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆