不需要的替代HTML实体由BeautifulSoup [英] Unwanted replacement of html entities by BeautifulSoup
问题描述
我有一些包含HTML MML,我使用MathType的Word文档生成。我有一个使用BeautifulSoup至prettify它一个python脚本,但问题是它需要像&放大器;#x2220;
键,把它变成实际的字节序列 0xE2均为0x88 0XA0
这是∠符号。这是一个问题,因为 0xE2均为0x88 0XA0
将没有显示∠在浏览器中。相反,浏览器间$ P $其中pts它作为一个系列的拉丁字符。这与所有的数学实体为好,如与三角洲发生的事情; &昂; &减去;与加; ...等
I have some html containing mml that I am generating from Word documents using MathType. I have a python script that uses BeautifulSoup to prettify it, but the problem is it takes something like ∠
and turns it into the actual byte sequence 0xE2 0x88 0xA0
which is the ∠ symbol. This is a problem because 0xE2 0x88 0xA0
won't display as ∠ in the browser. Instead the browser interprets it as a series of latin characters. This is happening with all the math entities as well, such as Δ ∠ − +... etc.
我从BeautifulSoup文件看,我可以看到如何把实体成字节序列,但我没有使用该命令;所有我使用的是prettify()。而且我没有看到BeautifulSoup文档的方式不把实体为字节序列。
I looked through the BeautifulSoup documentation and I can see how to turn entities into the byte sequences, but I'm not using that command; all I'm using is prettify(). And I didn't see a way in the BeautifulSoup documentation to not turn entities into byte sequences.
有谁知道,如果有一个在BeautifulSoup的设置来告诉它不改变实体的字节序列?我希望如此,因为它似乎有点哑必须撤消prettify运行后的损害:)
Does anyone know if there's a setting in BeautifulSoup to tell it not to change entities to byte sequences? I hope so because it seems kind of dumb to have to undo the damage after prettify runs :)
在此先感谢您的帮助!
推荐答案
我错过了BeautifulSoup文件的一部分。默认的输出格式化做描述的行为:他们把HTML实体进入UNI code字符。所以,这种行为可以通过使用不同的输出格式而改变。 (D'哦)
I missed part of the BeautifulSoup documentation. The default output formatters do the described behaviour: they turn html entities into the unicode characters. So, this behaviour can be changed by using a different output formatter. (D'oh)
你可以通过提供格式化参数prettify()的值更改此行为,EN code(),或者去code()......
"You can change this behavior by providing a value for the formatter argument to prettify(), encode(), or decode()...."
所以,如果我通过了格式=HTML
美丽的汤将统一code字符转换为HTML实体尽可能!好极了!谢谢美味的汤!
So if I pass in the formatter="html"
Beautiful Soup will convert Unicode characters to HTML entities whenever possible! Yay! Thank you Beautiful Soup!
(他们有这么大的文档,可惜我没看过整个事情越早:。$)
(And they have such great documentation. Pity I didn't read the whole thing sooner. :$)
这篇关于不需要的替代HTML实体由BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!