检测HTML中的字符编码 [英] Detecting character encoding in HTML
问题描述
我下载了一个HTML网页。 HTTP内容类型头指定一个字符编码,并且该页面具有指定另一个的 meta
标记。什么是正确的方法来处理呢?
I download an HTML page. The HTTP content-type header specifies one character encoding, and the page has a meta
tag that specifies another. What's the correct way to handle that?
我想正确不是正确的词,因为没有人遵循这个死的标准,所以什么方式,将使我最少的问题? / p>
I guess 'correct' isn't the right word, since nobody follows the damn standards anyway... so what's the way that will cause me the least problems?
推荐答案
做与web浏览器相同的操作:使用响应头。当通过HTTP投放HTML时,当响应标头存在时,元标记被忽略。仅当从本地磁盘文件系统读取HTML时,才使用元标记。这也由 w3 HTML规范明确指定。
Do the same as webbrowsers do: use the response header. When HTML is served over HTTP, the meta tag is ignored when the response header is present. Only when the HTML is read from local disk file system, the meta tag is been used. This is also explicitly specified by w3 HTML spec.
最低):
- Content-Type字段中的HTTPcharset参数。
- 一个META声明,其中http-equiv设置为Content-Type,
a设置为charset。 - 一个指定外部
资源的元素。
您使用的语言应该已经考虑到这一点。根据您熟悉Java的问题历史记录,我建议您抓取 Jsoup 。
Any existing decent HTML parser in whatever language you use should already take this into account. According your question history you're familiar with Java, I'd then suggest to grab Jsoup for this.
这篇关于检测HTML中的字符编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!