java.nio.charset.UnsupportedCharsetException:Jsoup中的X-MAC-ROMAN获取网页 [英] java.nio.charset.UnsupportedCharsetException: X-MAC-ROMAN in Jsoup getting a webpage

查看:476
本文介绍了java.nio.charset.UnsupportedCharsetException:Jsoup中的X-MAC-ROMAN获取网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有

Document document = Jsoup.connect(link).get();

有时某些URL会出现异常:

and some times for some urls I get an exception:

Exception in thread "main" java.nio.charset.UnsupportedCharsetException: X-MAC-ROMAN
    at java.nio.charset.Charset.forName(Unknown Source)
    at org.jsoup.helper.DataUtil.parseByteData(DataUtil.java:86)
    at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:469)
    at org.jsoup.helper.HttpConnection.get(HttpConnection.java:147)

我有一个catch块,例如:

I have a catch block as:

catch (IOException  e1)

我理解例外是因为Java是unicode,并且该网页/网站未遵循unicode.连接如何用于许多同时包含unicode和bytecode的网站

I understand the exception is because java is unicode and that webpage/site is not following unicode. how to handle this issue also the connect is used for many websites which include both unicode and bytecode

推荐答案

我了解例外情况是因为Java是unicode,并且该网页/网站未遵循unicode.

那不是完全正确的.您可能将陈述"Java is unicode"与Java使用Unicode将字符串/字符存储在内存中这一事实相混淆(您知道,计算机内存只能存储字节(零和一),不能存储字符,因此字符需要被转换为字节并使用特定的字符编码返回; Java为此使用unicode.

That's not entirely correct. You're likely confusing the statement "Java is unicode" with the fact that Java uses Unicode to store strings/characters in memory (you know, a computer memory can only store bytes (zeroes and ones), not characters, therefore characters needs to be converted to bytes and back using a specific character encoding; Java is using unicode for this).

发生此异常是因为运行Java代码的基础操作系统平台不支持此字符集,因此Java无法将Web服务器获得的字节转换为这种编码的字符.此字符集特定于Mac OS平台,您可能正在运行Windows左右.

This exception occurs because the underlying operating system platform wherein your Java code runs doesn't support this charset, so Java can't convert the from the webserver obtained bytes to characters in this encoding. This charset is specific to Mac OS platforms and you're likely running Windows or so.

如何处理此问题

联系网站管理员并将其报告为错误. 他们的错是他们使用了特定于平台的(Mac OS)编码而不是通用(ISO/UTF)编码.

Contact the website admin and report it as a bug. It's their fault that they used a platform-specific (Mac OS) encoding instead of an universal (ISO/UTF) encoding.

关于Jsoup,最好的选择是先由URL#openStream()将网站设为InputStream,然后将其提供给Jsoup#parse(),而在其中明确指定平台支持的字符编码,例如ISO- 8859-1.例如:

As to Jsoup, your best bet is to get website as InputStream by URL#openStream() first and then feed it to Jsoup#parse() instead wherein you explicitly specify the character encoding which is supported on your platform, such as ISO-8859-1. E.g.:

Document doc = Jsoup.parse(new URL(link).openStream(), "ISO-8859-1", link);

请注意,如果存在非ASCII字符,您仍然有可能以 Mojibake 结束.还要注意,您不应该对所有链接都执行此操作,而应该仅对抛出UnsupportedCharsetException的链接执行此操作(因此,请在其catch块中执行该作业).

Note that you still risk to end up with Mojibake when there are non-ASCII characters present. Also note that you shouldn't do it for all links, but only for those which threw UnsupportedCharsetException (thus, perform the job in its catch block).

但是我可以在chrome中访问它,为什么不能从Jsoup中访问

这是因为Chrome浏览器对您如此友善,以至于它忽略了未知的编码,而是选择了默认编码-这可能仍然会冒着网站在Mojibake中显示的风险;超出ASCII范围的任何内容都可能看起来格式错误.

That is because Chrome is trying to be so kind for you that it ignored the unknown encoding and chooses a default encoding instead --which might still risk in the website being displayed in Mojibake; anything beyond the ASCII range might look malformed.

connect用于许多同时包含unicode和字节码的网站

请刷新词汇表中字节码"一词的含义.这与字符编码完全无关.

Please refresh your vocabulary on the meaning of the word "bytecode". This has got absolutely nothing to do with character encodings.

这篇关于java.nio.charset.UnsupportedCharsetException:Jsoup中的X-MAC-ROMAN获取网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆