如何通过 HTTP 标头知道 HTML 内容的字符集? [英] How can I know the character set of HTML content by HTTP headers?

查看:49
本文介绍了如何通过 HTTP 标头知道 HTML 内容的字符集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道 HTTP header:Content-Type 中的参数 charset= 可用于确定 HTML 内容的字符集.但是如果Content-Type头部缺少参数,我怎么知道HTML内容的字符集?

我也知道有这样的标签

"meta charset="utf-8"";

在 HTML 中用于指定字符集.但是我们只有在解析 HTML 之后才能得到那个标签,解析 HTML 需要先知道字符集.

解决方案

Content-Type 标头中缺少显式 charset 属性的情况下,发送不同的媒体类型不同的传输有不同的默认字符集.

例如,只是为了显示一些定义:

RFC 2046,第 4.1.2 说:

<块引用>

与其他一些参数值不同,字符集参数的值不区分大小写.在没有字符集参数的情况下必须假定默认字符集是 US-ASCII.

RFC 2616,第 3.7.1 说:

<块引用>

字符集"参数与某些媒体类型一起使用来定义字符集(第 3.4 节)的数据.当发送者没有提供明确的字符集参数时,text"的媒体子类型将被发送.类型被定义为具有ISO-8859-1"的默认字符集值;通过 HTTP 接收时.ISO-8859-1"以外的字符集中的数据;或其子集必须标有适当的字符集值.有关兼容性问题,请参阅第 3.4.1 节.

后来被 RFC 7231附录 B:

<块引用>

文本媒体类型的 ISO-8859-1 的默认字符集已被删除;现在默认值是媒体类型定义所说的.同样,ISO-8859-1 的特殊处理已从 Accept-Charset 标头字段中删除.(第 3.1.1.3 节第 5.3.3 节).

RFC 3023,部分 3.1, 3.33.68.5 XML 媒体类型规范说:

<块引用>

符合 [RFC2046],如果接收到的 text/xml 实体省略了 charset 参数,则 MIME 处理器和 XML 处理器必须使用us-ascii"[ASCII] 的默认字符集值.在 XML MIME 实体通过 HTTP 传输的情况下,默认字符集值仍然是us-ascii".(注:这个规范和HTTP/1.1有不一致的地方,HTTP/1.1由于历史原因使用ISO-8859-1[ISO8859]作为默认值.由于XML是一种新格式,为了更好的I18N应该选择新的默认值.选择了 US-ASCII,因为它是 UTF-8 和 ISO-8859-1 的交集,并且因为它已被 MIME 使用.)

<块引用>

text/xml-external-parsed-entity 的字符集参数的处理方式与 第 3.1 节.

<块引用>

以下列表适用于顶级类型text"下的 text/xml、text/xml-external-parsed-entity 和基于 XML 的媒体类型;根据此规范定义字符集参数:

...

  • 如果未指定字符集参数,则默认为us-ascii".iso-8859-1"的默认值在 HTTP 中被显式覆盖.

<块引用>

此示例显示省略了 charset 参数的 text/xml.在这种情况下,MIME 和 XML 处理器必须假定字符集是us-ascii",这是 [RFC2046] 中指定的文本媒体类型的默认字符集值.us-ascii"的默认值即使 text/xml 实体是使用 HTTP 传输的,也保持不变.

不建议为 text/xml 省略字符集参数.例如,即使 XML MIME 实体的内容是 UTF-16 或 UTF-8,或者 XML MIME 实体具有明确的编码声明,XML 和 MIME 处理器必须假定字符集是us-ascii".

RFC 7159,部分 8.111 说:

<块引用>

JSON 文本应以 UTF-8、UTF-16 或 UTF-32 编码.默认编码为 UTF-8,以 UTF-8 编码的 JSON 文本是可互操作的,因为它们将被最大数量的实现成功读取;有许多实现无法成功读取其他编码(例如 UTF-16 和 UTF-32)的文本.

实现不得在 JSON 文本的开头添加字节顺序标记.为了互操作性,解析 JSON 文本的实现可能会忽略字节顺序标记的存在,而不是将其视为错误.

<块引用>

注意:没有字符集";参数是为此注册定义的.添加一个确实对合规的收件人没有影响.

所以,一般来说,如果您想知道给定资源使用的字符集,并且该字符集不是通过外部方式表达的,例如 Content-Type 的 charset 属性 标头,然后您必须确定您正在处理的数据类型,然后根据该数据类型的规范概述确定其字符集.

就您而言,您是通过 HTTP 处理 HTML,因此 RFC 2616 规则适用于您.HTML 5 规范,第 8.2.2.2 定义了一个非常详细的算法,用于在没有 charset 属性在 Content-Type 标头中指定.该算法首先检查是否存在 UTF BOM,如果不存在则假设HTML 是 8 位的,并针对任何包含字符集或语言声明的 <meta> 标签对其进行解析.

XML 1.0 规范Appendix F,还定义了一个算法,可以很容易的确定XML prolog使用的字符集,所以你可以阅读它的 Encoding 属性(如果存在)用于确定剩余 XML 的字符集.

I know the parameter charset= in the HTTP header:Content-Type can be used to determine the character set of the HTML content. But if the parameter is missing in the Content-Type header, how can I know the character set of the HTML content?

I also know there is tag such as

"meta charset="utf-8""

in HTML that is used to specify the character set. But we get that tag only after parsing the HTML and parsing HTML needs to know the character set first.

解决方案

In the absence of an explicit charset attribute in the Content-Type header, different media types sent over different transports have different default character sets.

For instance, just to show a few definitions:

RFC 2046, Section 4.1.2 of the MIME specification says:

Unlike some other parameter values, the values of the charset parameter are NOT case sensitive. The default character set, which must be assumed in the absence of a charset parameter, is US-ASCII.

RFC 2616, Section 3.7.1 of the HTTP protocol specification says:

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

Which was later reversed by RFC 7231, Appendix B:

The default charset of ISO-8859-1 for text media types has been removed; the default is now whatever the media type definition says. Likewise, special treatment of ISO-8859-1 has been removed from the Accept-Charset header field. (Section 3.1.1.3 and Section 5.3.3).

RFC 3023, Sections 3.1, 3.3, 3.6, and 8.5 of the XML Media Types spec say:

Conformant with [RFC2046], if a text/xml entity is received with the charset parameter omitted, MIME processors and XML processors MUST use the default charset value of "us-ascii"[ASCII]. In cases where the XML MIME entity is transmitted via HTTP, the default charset value is still "us-ascii". (Note: There is an inconsistency between this specification and HTTP/1.1, which uses ISO-8859-1[ISO8859] as the default for a historical reason. Since XML is a new format, a new default should be chosen for better I18N. US-ASCII was chosen, since it is the intersection of UTF-8 and ISO-8859-1 and since it is already used by MIME.)

The charset parameter of text/xml-external-parsed-entity is handled the same as that of text/xml as described in Section 3.1.

The following list applies to text/xml, text/xml-external-parsed-entity, and XML-based media types under the top-level type "text" that define the charset parameter according to this specification:

...

  • If the charset parameter is not specified, the default is "us-ascii". The default of "iso-8859-1" in HTTP is explicitly overridden.

This example shows text/xml with the charset parameter omitted. In this case, MIME and XML processors MUST assume the charset is "us-ascii", the default charset value for text media types specified in [RFC2046]. The default of "us-ascii" holds even if the text/xml entity is transported using HTTP.

Omitting the charset parameter is NOT RECOMMENDED for text/xml. For example, even if the contents of the XML MIME entity are UTF-16 or UTF-8, or the XML MIME entity has an explicit encoding declaration, XML and MIME processors MUST assume the charset is "us-ascii".

RFC 7159, Sections 8.1 and 11, of the JSON specification says:

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32).

Implementations MUST NOT add a byte order mark to the beginning of a JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

Note: No "charset" parameter is defined for this registration. Adding one really has no effect on compliant recipients.

So, in general, if you want to know the charset used by a given resource, and that charset is not expressed through external means, like the charset attribute of a Content-Type header, then you have to determine what type of data you are dealing with, and then determine its charset based on how that data type's specification outlines.

In your case, you are dealing with HTML over HTTP, so the RFC 2616 rule applies to you. The HTML 5 spec, Section 8.2.2.2 defines a very detailed algorithm for determining the HTML's charset when no charset attribute is specified in the Content-Type header. That algorithm involves first checking for the presence of a UTF BOM, and if none is present then assume the HTML is 8-bit and parse it for any <meta> tags that contain character set or language declarations.

The XML 1.0 specification, Appendix F, also defines an algorithm that makes it easy to determine the character set used by the XML prolog, so you can read its Encoding attribute, if present, to determine the character set of the remaining XML.

这篇关于如何通过 HTTP 标头知道 HTML 内容的字符集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆