BeautifulSoup4 stripped_strings给我字节的对象? [英] BeautifulSoup4 stripped_strings gives me byte objects?
问题描述
我试图让文本从一个块引用它看起来像这样的:
I'm trying to get the text out of a blockquote which looks like this:
<blockquote class="postcontent restore ">
01 Oyasumi
<br></br>
02 DanSin'
<br></br>
03 w.t.s.
<br></br>
04 Lovism
<br></br>
05 NoName
<br></br>
06 Gakkou
<br></br>
07 Happy☆Day
<br></br>
08 Endless End.
</blockquote>
我试图做到这一点在Python 2.7(它不能去code这就是为什么我试图使用EN code中的字符☆):
What I'm trying to do is this in python 2.7 (it can't decode the ☆ character which is why I tried to use encode):
soup = BeautifulSoup(r.text, "html5lib") #r is from a requests get request
content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings
for line in content:
print(line.encode("utf-8"))
这是我所得到的:
And this is what I get:
b'01 Oyasumi'
b"02 DanSin'"
b'03 w.t.s.'
b'04 Lovism'
b'05 NoName'
b'06 Gakkou'
b'07 Happy\xe2\x98\x86Day'
b'08 Endless End.'
我在做什么错了?
What am I doing wrong?
推荐答案
的问题是,美丽的汤转换原来的编码统一code。如果 from_encoding
是使用一个子库,称为未指定统一code,该死。在的编码文档中的部分更多信息。
The problem is that Beautiful Soup converts the original encoding to Unicode if the from_encoding
is not specified using a sub-library called Unicode, Dammit. More info in the Encodings section in the documentation.
>>> from bs4 import BeautifulSoup
>>> doc = '''<blockquote class="postcontent restore ">
... 01 Oyasumi
... <br></br>
... 02 DanSin'
... <br></br>
... 03 w.t.s.
... <br></br>
... 04 Lovism
... <br></br>
... 05 NoName
... <br></br>
... 06 Gakkou
... <br></br>
... 07 Happy☆Day
... <br></br>
... 08 Endless End.
... </blockquote>'''
>>> soup = BeautifulSoup(doc, 'html5lib')
>>> soup.original_encoding
u'windows-1252'
>>> content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings
>>> for line in content:
... print(line)
...
01 Oyasumi
02 DanSin'
03 w.t.s.
04 Lovism
05 NoName
06 Gakkou
07 Happy☆Day
08 Endless End.
要解决这个问题,你有两种选择:
To fix this you have two options:
-
通过传递正确的
from_encoding
参数或排除错误的错误编码该死的猜测。一个问题是,并不是所有的分析器支持exclude_encodings
参数。例如,html5lib
树构建器不支持exclude_encoding
By passing in the correct
from_encoding
parameter or excluding the wrong the wrong encoding Dammit is guessing. One problem is that not all Parsers support theexclude_encodings
argument. For example thehtml5lib
tree builder doesn't supportexclude_encoding
>>> soup = BeautifulSoup(doc, 'html5lib', from_encoding='utf-8')
>>> content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings
>>> for line in content:
... print(line)
...
01 Oyasumi
02 DanSin'
03 w.t.s.
04 Lovism
05 NoName
06 Gakkou
07 Happy☆Day
08 Endless End.
>>>
使用 LXML
分析器
>>> soup = BS(doc, 'lxml')
>>> soup.original_encoding
'utf-8'
>>> content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings
>>> for line in content:
... print(line)
...
01 Oyasumi
02 DanSin'
03 w.t.s.
04 Lovism
05 NoName
06 Gakkou
07 Happy☆Day
08 Endless End.
这篇关于BeautifulSoup4 stripped_strings给我字节的对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!