如何通过http标头了解html内容的字符集? [英] How to know the charset of html content by http headers?

查看:177
本文介绍了如何通过http标头了解html内容的字符集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道http标头中的参数charset =:Content-Type可用于确定html内容的字符集.但是,如果Content-Type标头中缺少参数,如何知道html内容的字符集?我也知道有这样的标签 "meta charset =" utf-8" 用于指定字符集的html中.但是,只有在解析html并且解析html需要首先知道字符集之后,我们才能获得该标签.

I know the parameter charset= in http header:Content-Type can be used to determine the charset of the html content. But if the parameter is missing in the Content-Type header, how to know the charset of the html content? I also know there is tag such as "meta charset="utf-8"" in html that is used to specify the charset. But we get that tag only after parsing the html and parsing html need to know the charset first.

推荐答案

Content-Type标头中没有显式的charset属性时,通过不同传输方式发送的不同媒体类型具有不同的默认字符集.

In the absence of an explicit charset attribute in the Content-Type header, different media types sent over different transports have different default charsets.

例如,仅显示一些定义:

For instance, just to show a few definitions:

RFC 2046 ,第

与其他一些参数值不同,charset参数的值不区分大小写. 默认字符集是US-ASCII.在没有字符集参数的情况下必须假定默认字符集.

RFC 2616 ,第

"charset"参数与某些媒体类型一起使用以定义字符集(数据的第3.4节). 当发件人未提供任何明确的字符集参数时,文本"类型的媒体子类型将定义为通过HTTP接收时的默认字符集值为"ISO-8859-1" .除"ISO-8859-1"以外的字符集中的数据或其子集必须使用适当的字符集值标记.有关兼容性问题,请参见第3.4.1节.

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

后来被 RFC 7231

文本媒体类型的默认ISO-8859-1字符集已删除;现在,默认值是媒体类型定义所说的.同样,已从Accept-Charset标头字段中删除了对ISO-8859-1的特殊处理. (第3.1.1.3节

The default charset of ISO-8859-1 for text media types has been removed; the default is now whatever the media type definition says. Likewise, special treatment of ISO-8859-1 has been removed from the Accept-Charset header field. (Section 3.1.1.3 and Section 5.3.3).

RFC 3023 ,第 3.6 8.5 说:

与[RFC2046]一致,如果接收到省略了charset参数的text/xml实体,则MIME处理器和XML处理器必须使用默认字符集值"us-ascii" [ASCII].在通过HTTP传输XML MIME实体的情况下,默认字符集值仍为"us-ascii" . (注意:此规范与HTTP/1.1之间存在不一致之处,出于历史原因,该规范使用ISO-8859-1 [ISO8859]作为默认值.由于XML是一种新格式,因此应选择新的默认值以获得更好的I18N.之所以选择US-ASCII,是因为它是UTF-8和ISO-8859-1的交集,并且已经被MIME使用.)

Conformant with [RFC2046], if a text/xml entity is received with the charset parameter omitted, MIME processors and XML processors MUST use the default charset value of "us-ascii"[ASCII]. In cases where the XML MIME entity is transmitted via HTTP, the default charset value is still "us-ascii". (Note: There is an inconsistency between this specification and HTTP/1.1, which uses ISO-8859-1[ISO8859] as the default for a historical reason. Since XML is a new format, a new default should be chosen for better I18N. US-ASCII was chosen, since it is the intersection of UTF-8 and ISO-8859-1 and since it is already used by MIME.)

按照

以下列表适用于顶级类型"text"下的text/xml,text/xml-external-parsed-entity和基于XML的媒体类型,这些类型根据此规范定义了charset参数:

The following list applies to text/xml, text/xml-external-parsed-entity, and XML-based media types under the top-level type "text" that define the charset parameter according to this specification:

...

  • 如果未指定charset参数,则默认值为"us-ascii". HTTP中的默认值"iso-8859-1"被明确覆盖.
  • If the charset parameter is not specified, the default is "us-ascii". The default of "iso-8859-1" in HTTP is explicitly overridden.

此示例显示了text/xml,其中省略了charset参数.在这种情况下,MIME和XML处理器必须假定字符集为"us-ascii",这是[RFC2046]中指定的文本媒体类型的默认字符集值. 即使使用HTTP传输text/xml实体,"us-ascii"的默认设置仍然有效.

不建议为text/xml省略charset参数.例如,即使XML MIME实体的内容为UTF-16或UTF-8,或者XML MIME实体具有明确的编码声明,XML和MIME处理器也必须假定字符集为"us-ascii".

Omitting the charset parameter is NOT RECOMMENDED for text/xml. For example, even if the contents of the XML MIME entity are UTF-16 or UTF-8, or the XML MIME entity has an explicit encoding declaration, XML and MIME processors MUST assume the charset is "us-ascii".

RFC 7159 ,第

JSON文本应以UTF-8,UTF-16或UTF-32编码.默认编码为UTF-8 ,以UTF-8编码的JSON文本可以互操作,即可以最大数量的实现成功读取它们.有许多实现无法成功读取其他编码(例如UTF-16和UTF-32)的文本.

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32).

实现不得在JSON文本的开头添加字节顺序标记.为了互操作性,解析JSON文本的实现可以忽略字节顺序标记的存在,而不是将其视为错误.

Implementations MUST NOT add a byte order mark to the beginning of a JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

注意:此注册未定义字符集"参数.确实添加一个对符合条件的收件人没有任何影响.

Note: No "charset" parameter is defined for this registration. Adding one really has no effect on compliant recipients.

因此,通常来说,如果您想知道给定资源使用的字符集,并且该字符集不是通过外部方式(例如Content-Type标头的charset属性)表示的,则必须确定类型的数据,然后根据该数据类型的规范概述确定其字符集.

So, in general, if you want to know the charset used by a given resource, and that charset is not expressed through external means, like the charset attribute of a Content-Type header, then you have to determine what type of data you are dealing with, and then determine its charset based on how that data type's specification outlines.

在您的情况下,您正在处理基于HTTP的HTML,因此RFC 2616规则适用于您. HTML 5规范的第

In your case, you are dealing with HTML over HTTP, so the RFC 2616 rule applies to you. The HTML 5 spec, Section 8.2.2.2 defines a very detailed algorithm for determining the HTML's charset when no charset attribute is specified in the Content-Type header. That algorithm involves first checking for the presence of a UTF BOM, and if none is present then assume the HTML is 8-bit and parse it for any <meta> tags that contain charset or language declarations.

XML 1.0规范

The XML 1.0 spec, Appendix F, also defines an algorithm that makes it easy to determine the charset used by the XML prolog so you can read its Encoding attribute, if present, to determine the charset of the remaining XML.

这篇关于如何通过http标头了解html内容的字符集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆