浏览器如何确定使用的编码? [英] How do browsers determine the encoding used?

查看:53
本文介绍了浏览器如何确定使用的编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道有两种设置编码的方法:

I do understand there are 2 ways to set the encoding:

  1. 使用Content-Type标头.
  2. 通过在HTML中使用元标记

由于Content-Type标头不是必需的,因此必须明确设置(服务器端可以根据需要设置),而meta标签也是可选的.

Since Content-Type header is not mandatory and is required to be set explicitly (the server side can set it if it wants) and meta tag is also optional.

如果两者都不存在,浏览器如何确定用于解析内容的编码?

In case both of these are not present, how does the browser determine the encoding used for parsing the content?

推荐答案

他们可以基于启发式猜测

我不知道当今的编译器在编码检测方面的表现如何,但是MS Word在这方面做得很好,甚至可以识别我从未听过的字符集.您可以打开带有随机编码的* .txt文件并查看.

They can guess it based on heuristic

I don't know how good are compilers today at encoding detection but MS Word did a very good job at it and recognizes even charsets I've never heard before. You can just open a *.txt file with random encoding and see.

此算法通常涉及字节模式的统计分析,例如将被检测到的每个代码页中编码的各种语言的三字组合的频率分布;这样的统计分析也可以用于执行语言检测.

This algorithm usually involves statistical analysis of byte patterns, like frequency distribution of trigraphs of various languages encoded in each code page that will be detected; such statistical analysis can also be used to perform language detection.

https://en.wikipedia.org/wiki/Charset_detection

Firefox使用 Mozilla字符集检测器.此处进行了说明,您也可以更改其启发式偏好

Firefox uses the Mozilla Charset Detectors. The way it works is explained here and you can also change its heuristic preferences

Chrome以前使用过 ICU检测器,但切换为 2年前

Chrome previously used ICU detector but switched to CED almost 2 years ago

没有一种检测算法是完美的,它们可以错误地猜测出像这样,因为它只是在猜测!

None of the detection algorithms are perfect, they can guess it incorrectly like this, because it's just guessing anyway!

此过程并非万无一失,因为它取决于统计数据.

This process is not foolproof because it depends on statistical data.

所以这就是著名的 Bush隐藏事实 错误的方式发生.猜错也给系统带来了漏洞

so that's how the famous Bush hid the facts bug occurred. Bad guessing also introduces a vulnerability to the system

对于所有那些持怀疑态度的人,有一个很好的理由为什么应该明确说明字符编码.当没有告知浏览器文本的字符编码是什么时,它必须猜测:有时猜测是错误的.黑客可以操纵此猜测,以使XSS越过过滤器,然后欺骗浏览器将其作为活动代码执行.Google UTF-7漏洞就是一个很好的例子.

For all those skeptics out there, there is a very good reason why the character encoding should be explicitly stated. When the browser isn't told what the character encoding of a text is, it has to guess: and sometimes the guess is wrong. Hackers can manipulate this guess in order to slip XSS past filters and then fool the browser into executing it as active code. A great example of this is the Google UTF-7 exploit.

http://htmlpurifier.org/docs/enduser-utf8.html#fixcharset-none

因此,应始终明确声明编码.

As a result, the encoding should always be explicitly stated.

这篇关于浏览器如何确定使用的编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆