带有 BOM 的 UTF-8 HTML 和 CSS 文件(以及如何使用 Python 删除 BOM) [英] UTF-8 HTML and CSS files with BOM (and how to remove the BOM with Python)
问题描述
首先,一些背景知识:我正在使用 Python 开发一个 Web 应用程序.我的所有(文本)文件当前都以 UTF-8 格式存储,并带有 BOM.这包括我所有的 HTML 模板和 CSS 文件.这些资源作为二进制数据(BOM 和所有)存储在我的数据库中.
当我从数据库中检索模板时,我使用 template.decode('utf-8')
对它们进行解码.当 HTML 到达浏览器时,BOM 出现在 HTTP 响应正文的开头.这会在 Chrome 中产生一个非常有趣的错误:
额外的遭遇.将属性迁移回原始 元素并忽略标签.
Chrome 似乎在看到 BOM 并将其误认为内容时会自动生成一个 标签,从而使真正的
标签出错.
那么,使用 Python,从我的 UTF-8 编码模板中删除 BOM 的最佳方法是什么(如果它存在 - 我不能保证将来会这样做)?
对于其他基于文本的文件,如 CSS,主流浏览器是否会正确解释(或忽略)BOM?它们作为没有 .decode('utf-8')
的纯二进制数据发送.
注意:我使用的是 Python 2.5.
谢谢!
自您声明:
<块引用>我所有的(文本)文件当前都是与 BOM 一起存储在 UTF-8 中
然后使用utf-8-sig"编解码器解码它们:
<预><代码>>>>s = u'Hello, world!'.encode('utf-8-sig')>>>秒'xefxbbxbf你好,世界!>>>s.decode('utf-8-sig')'你好,世界!它会自动删除预期的 BOM,如果 BOM 不存在也能正常工作.
First, some background: I'm developing a web application using Python. All of my (text) files are currently stored in UTF-8 with the BOM. This includes all my HTML templates and CSS files. These resources are stored as binary data (BOM and all) in my DB.
When I retrieve the templates from the DB, I decode them using template.decode('utf-8')
. When the HTML arrives in the browser, the BOM is present at the beginning of the HTTP response body. This generates a very interesting error in Chrome:
Extra <html> encountered. Migrating attributes back to the original <html> element and ignoring the tag.
Chrome seems to generate an <html>
tag automatically when it sees the BOM and mistakes it for content, making the real <html>
tag an error.
So, using Python, what is the best way to remove the BOM from my UTF-8 encoded templates (if it exists -- I can't guarantee this in the future)?
For other text-based files like CSS, will major browsers correctly interpret (or ignore) the BOM? They are being sent as plain binary data without .decode('utf-8')
.
Note: I am using Python 2.5.
Thanks!
Since you state:
All of my (text) files are currently stored in UTF-8 with the BOM
then use the 'utf-8-sig' codec to decode them:
>>> s = u'Hello, world!'.encode('utf-8-sig')
>>> s
'xefxbbxbfHello, world!'
>>> s.decode('utf-8-sig')
u'Hello, world!'
It automatically removes the expected BOM, and works correctly if the BOM is not present as well.
这篇关于带有 BOM 的 UTF-8 HTML 和 CSS 文件(以及如何使用 Python 删除 BOM)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!