不需要的替代HTML实体由BeautifulSoup [英] Unwanted replacement of html entities by BeautifulSoup

查看:202
本文介绍了不需要的替代HTML实体由BeautifulSoup的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些包含HTML MML,我使用MathType的Word文档生成。我有一个使用BeautifulSoup至prettify它一个python脚本,但问题是它需要像&放大器;#x2220; 键,把它变成实际的字节序列 0xE2均为0x88 0XA0 这是∠符号。这是一个问题,因为 0xE2均为0x88 0XA0 将没有显示∠在浏览器中。相反,浏览器间$ P $其中pts它作为一个系列的拉丁字符。这与所有的数学实体为好,如与三角洲发生的事情; &昂; &减去;与加; ...等

I have some html containing mml that I am generating from Word documents using MathType. I have a python script that uses BeautifulSoup to prettify it, but the problem is it takes something like ∠ and turns it into the actual byte sequence 0xE2 0x88 0xA0 which is the ∠ symbol. This is a problem because 0xE2 0x88 0xA0 won't display as ∠ in the browser. Instead the browser interprets it as a series of latin characters. This is happening with all the math entities as well, such as Δ ∠ − +... etc.

我从BeautifulSoup文件看,我可以看到如何把实体成字节序列,但我没有使用该命令;所有我使用的是prettify()。而且我没有看到BeautifulSoup文档的方式不把实体为字节序列。

I looked through the BeautifulSoup documentation and I can see how to turn entities into the byte sequences, but I'm not using that command; all I'm using is prettify(). And I didn't see a way in the BeautifulSoup documentation to not turn entities into byte sequences.

有谁知道,如果有一个在BeautifulSoup的设置来告诉它不改变实体的字节序列?我希望如此,因为它似乎有点哑必须撤消prettify运行后的损害:)

Does anyone know if there's a setting in BeautifulSoup to tell it not to change entities to byte sequences? I hope so because it seems kind of dumb to have to undo the damage after prettify runs :)

在此先感谢您的帮助!

推荐答案

我错过了BeautifulSoup文件的一部分。默认的输出格式化做描述的行为:他们把HTML实体进入UNI code字符。所以,这种行为可以通过使用不同的输出格式而改变。 (D'哦)

I missed part of the BeautifulSoup documentation. The default output formatters do the described behaviour: they turn html entities into the unicode characters. So, this behaviour can be changed by using a different output formatter. (D'oh)

你可以通过提供格式化参数prettify()的值更改此行为,EN code(),或者去code()......

"You can change this behavior by providing a value for the formatter argument to prettify(), encode(), or decode()...."

所以,如果我通过了格式=HTML美丽的汤将统一code字符转换为HTML实体尽可能!好极了!谢谢美味的汤!

So if I pass in the formatter="html" Beautiful Soup will convert Unicode characters to HTML entities whenever possible! Yay! Thank you Beautiful Soup!

(他们有这么大的文档,可惜我没看过整个事情越早:。$)

(And they have such great documentation. Pity I didn't read the whole thing sooner. :$)

这篇关于不需要的替代HTML实体由BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆