防止BeautifulSoup的renderContents()更改& nbsp;到 [英] Prevent BeautifulSoup's renderContents() from changing   to Â
问题描述
我正在使用 bs4 做一些工作在某些文本上,但在某些情况下会将
字符转换为Â
.我能说的最好的是,这是一个从UTF-8到latin1的编码不匹配(或相反?)
I'm using bs4 to do some work on some text, but in some cases it converts
characters to Â
. The best I can tell is that this is an encoding mismatch from UTF-8 to latin1 (or reverse?)
我的网络应用程序中的所有内容都是UTF-8,Python3是UTF-8,并且我已经确认数据库是UTF-8.
Everything in my web app is UTF-8, Python3 is UTF-8, and I've confirmed the database is UTF-8.
我已将问题缩小到这一行:
I've narrowed down the problem to this one line:
print("Before soup: " + text) # Before soup:
soup = BeautifulSoup(text, "html.parser")
#.... do stuff to soup, but all commented out for this testing.
soup = BeautifulSoup(soup.renderContents(), "html.parser") # <---- PROBLEM!
print(soup.renderContents()) # b'\xc3\x82\xc2\xa0'
print("After SOUP: " + str(soup)) # After SOUP: Â
如何防止renderContents()更改编码?没有没有文档关于此功能!
How do I prevent renderContents() from changing the encoding? There is no documentation on this function!
进一步研究文档后,这似乎是是关键,但我仍然无法解决问题!
Upon further research into the docs, this seems to be the key, but I still can't fix the problem!
print(soup.prettify(formatter="html")) # Â
推荐答案
好吧,显然我对文档没有足够深入的了解,在这里可以找到答案:
Ok, apparently I hadn't read deep enough in to the docs, here's where the answer can be found:
来自 https://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings :
问题在于提供给BS的代码片段太短,以至BeautifulSoup的子库Unicode, Dammit
没有足够的信息来正确猜测编码.
The problem is that the snippet of code provided to BS is so short, that BeautifulSoup's sub-library Unicode, Dammit
, doesn't have enough info to properly guess the encoding.
Unicode, Dammit
在大多数情况下都可以正确猜测,但有时可以 犯错误. ...你可以避免 错误和延迟,方法是将其传递给BeautifulSoup构造函数from_encoding
.
Unicode, Dammit
guesses correctly most of the time, but sometimes it makes mistakes. ...you can avoid mistakes and delays by passing it to the BeautifulSoup constructor asfrom_encoding
.
所以关键是每次构造BS时都要添加from_encoding="UTF-8"
:
So the key is to add from_encoding="UTF-8"
to each time the BS is constructed:
soup = BeautifulSoup(soup.renderContents(), "html.parser", from_encoding="UTF-8")
这篇关于防止BeautifulSoup的renderContents()更改& nbsp;到的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!