喜欢HTML元标记或HTTP标头中的字符集声明? [英] Prefer charset declaration in HTML meta tag or HTTP header?

查看:139
本文介绍了喜欢HTML元标记或HTTP标头中的字符集声明?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析很多网站。一切工作正常,我正在阅读字符集声明来转换编码。现在我有一个 http://celleheute.de/sonntagsfuhrung-3/ 的问题。 p>

HTML元标记说,内容被编码为ISO-8859-2,但是HTTP头就是UTF-8。而真正的内容是UTF编码,所以当我的解析器尝试将内容转换为ISO时,它会打破一些字符。



现在我的问题是,哪个声明应该喜欢哪一种?我应该忽略元标记,当我可以在HTTP标题中找到声明,反之亦然吗?大部分网络浏览器会做什么?

解决方案

要了解什么是现代浏览器,您应该从 http://dev.w3.org/html5/spec/parsing.html#determining -the-character-encoding



第一步和第二步与问题最相关。他们说



  1. 如果用户已明确指示用户代理覆盖
    文档的字符编码具有特定的编码,可选的
    返回的编码具有置信度,并中止这些
    步骤。


  2. 如果传输图层指定一个编码,并且它被支持,
    返回那些具有置信度的编码,并且中止这些
    步骤。



这意味着,真正的HTTP头优先于除用户超越之外的所有内容。



除此之外它可以变得复杂。一个字节顺序标记可以例如优先于元标记。






更新:由于这个答案是写的,规格已更改(约2012年中),以便字节顺序标记现在优先于HTTP标头。


I'm parsing a lot of sites. All works fine, I'm reading also charset declarations to convert encodings. Now I've a problem with http://celleheute.de/sonntagsfuhrung-3/.

The HTML meta tag says, that the content is encoded as ISO-8859-2, but the HTTP header says, it's UTF-8. And really, the content is UTF encoded, so when my parser tries to convert the content to ISO it will break some chars.

Now my question is, which declaration should I prefer? Should I ignore meta tags, when I can find the declaration in HTTP header or vice versa? What will most web browsers do?

解决方案

To understand what modern browsers do, you should start reading at http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding

Steps one and two are most relevant to the question. They say

  1. If the user has explicitly instructed the user agent to override the document's character encoding with a specific encoding, optionally return that encoding with the confidence certain and abort these steps.

  2. If the transport layer specifies an encoding, and it is supported, return that encoding with the confidence certain, and abort these steps.

which means that the real HTTP header takes precedence over everything except user over-ride.

Beyond that it can get complex. A byte order mark, can for example, take precedence over the meta tag.


UPDATE: Since this answer was written, the spec changed (around mid-2012) so that the byte order mark now takes precedence over the HTTP header.

这篇关于喜欢HTML元标记或HTTP标头中的字符集声明?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆