防止BeautifulSoup的renderContents()更改& nbsp;到 [英] Prevent BeautifulSoup's renderContents() from changing   to Â

查看:149
本文介绍了防止BeautifulSoup的renderContents()更改& nbsp;到的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 bs4 做一些工作在某些文本上,但在某些情况下会将 字符转换为Â.我能说的最好的是,这是一个从UTF-8到latin1的编码不匹配(或相反?)

I'm using bs4 to do some work on some text, but in some cases it converts   characters to Â. The best I can tell is that this is an encoding mismatch from UTF-8 to latin1 (or reverse?)

我的网络应用程序中的所有内容都是UTF-8,Python3是UTF-8,并且我已经确认数据库是UTF-8.

Everything in my web app is UTF-8, Python3 is UTF-8, and I've confirmed the database is UTF-8.

我已将问题缩小到这一行:

I've narrowed down the problem to this one line:

print("Before soup: " + text)  # Before soup:  
soup = BeautifulSoup(text, "html.parser")
#.... do stuff to soup, but all commented out for this testing.
soup = BeautifulSoup(soup.renderContents(), "html.parser")  # <---- PROBLEM!
print(soup.renderContents())  # b'\xc3\x82\xc2\xa0'
print("After SOUP: " + str(soup))  # After SOUP: Â

如何防止renderContents()更改编码?没有没有文档关于此功能!

How do I prevent renderContents() from changing the encoding? There is no documentation on this function!

进一步研究文档后,这似乎是是关键,但我仍然无法解决问题!

Upon further research into the docs, this seems to be the key, but I still can't fix the problem!

print(soup.prettify(formatter="html"))  # &Acirc;&nbsp;

推荐答案

好吧,显然我对文档没有足够深入的了解,在这里可以找到答案:

Ok, apparently I hadn't read deep enough in to the docs, here's where the answer can be found:

来自 https://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings :

问题在于提供给BS的代码片段太短,以至BeautifulSoup的子库Unicode, Dammit没有足够的信息来正确猜测编码.

The problem is that the snippet of code provided to BS is so short, that BeautifulSoup's sub-library Unicode, Dammit, doesn't have enough info to properly guess the encoding.

Unicode, Dammit在大多数情况下都可以正确猜测,但有时可以 犯错误. ...你可以避免 错误和延迟,方法是将其传递给BeautifulSoup构造函数 from_encoding.

Unicode, Dammit guesses correctly most of the time, but sometimes it makes mistakes. ...you can avoid mistakes and delays by passing it to the BeautifulSoup constructor as from_encoding.

所以关键是每次构造BS时都要添加from_encoding="UTF-8":

So the key is to add from_encoding="UTF-8" to each time the BS is constructed:

soup = BeautifulSoup(soup.renderContents(), "html.parser", from_encoding="UTF-8")

这篇关于防止BeautifulSoup的renderContents()更改&amp; nbsp;到的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆