BeautifulSoup4 stripped_strings给我字节的对象？ [英] BeautifulSoup4 stripped_strings gives me byte objects?

查看：2338 发布时间：2016/8/5 19:00:30 python python-2.7 unicode encoding beautifulsoup

本文介绍了BeautifulSoup4 stripped_strings给我字节的对象？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图让文本从一个块引用它看起来像这样的：

I'm trying to get the text out of a blockquote which looks like this:

<blockquote class="postcontent restore ">
    01 Oyasumi
    <br></br>
    02 DanSin'
    <br></br>
    03 w.t.s.
    <br></br>
    04 Lovism
    <br></br>
    05 NoName
    <br></br>
    06 Gakkou
    <br></br>
    07 Happy☆Day
    <br></br>
    08 Endless End.
</blockquote>

我试图做到这一点在Python 2.7（它不能去code这就是为什么我试图使用EN code中的字符☆）：

What I'm trying to do is this in python 2.7 (it can't decode the ☆ character which is why I tried to use encode):

soup = BeautifulSoup(r.text, "html5lib") #r is from a requests get request
content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings
for line in content:
    print(line.encode("utf-8"))

这是我所得到的：

And this is what I get:

b'01 Oyasumi'
b"02 DanSin'"
b'03 w.t.s.'
b'04 Lovism'
b'05 NoName'
b'06 Gakkou'
b'07 Happy\xe2\x98\x86Day'
b'08 Endless End.'

我在做什么错了？

What am I doing wrong?

推荐答案

的问题是，美丽的汤转换原来的编码统一code。如果 from_encoding 是使用一个子库，称为未指定统一code，该死。在的编码文档中的部分更多信息。

The problem is that Beautiful Soup converts the original encoding to Unicode if the from_encoding is not specified using a sub-library called Unicode, Dammit. More info in the Encodings section in the documentation.

>>> from bs4 import BeautifulSoup
>>> doc = '''<blockquote class="postcontent restore ">
...     01 Oyasumi
...     <br></br>
...     02 DanSin'
...     <br></br>
...     03 w.t.s.
...     <br></br>
...     04 Lovism
...     <br></br>
...     05 NoName
...     <br></br>
...     06 Gakkou
...     <br></br>
...     07 Happy☆Day
...     <br></br>
...     08 Endless End.
... </blockquote>'''
>>> soup = BeautifulSoup(doc, 'html5lib')
>>> soup.original_encoding 
u'windows-1252'
>>> content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings
>>> for line in content:
...     print(line)
... 
01 Oyasumi
02 DanSin'
03 w.t.s.
04 Lovism
05 NoName
06 Gakkou
07 Happyâ˜†Day
08 Endless End.

要解决这个问题，你有两种选择：

To fix this you have two options:

通过传递正确的 from_encoding 参数或排除错误的错误编码该死的猜测。一个问题是，并不是所有的分析器支持 exclude_encodings 参数。例如， html5lib 树构建器不支持 exclude_encoding

By passing in the correct from_encoding parameter or excluding the wrong the wrong encoding Dammit is guessing. One problem is that not all Parsers support the exclude_encodings argument. For example the html5lib tree builder doesn't support exclude_encoding

>>> soup = BeautifulSoup(doc, 'html5lib', from_encoding='utf-8')
>>> content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings
>>> for line in content:
...     print(line)
... 
01 Oyasumi
02 DanSin'
03 w.t.s.
04 Lovism
05 NoName
06 Gakkou
07 Happy☆Day
08 Endless End.
>>>

使用 LXML 分析器

>>> soup = BS(doc, 'lxml')
>>> soup.original_encoding
'utf-8'
>>> content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings
>>> for line in content:
...     print(line)
... 
01 Oyasumi
02 DanSin'
03 w.t.s.
04 Lovism
05 NoName
06 Gakkou
07 Happy☆Day
08 Endless End.

这篇关于BeautifulSoup4 stripped_strings给我字节的对象？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

BeautifulSoup4 stripped_strings给我字节的对象？ [英] BeautifulSoup4 stripped_strings gives me byte objects?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

BeautifulSoup4 stripped_strings给我字节的对象？ [英] BeautifulSoup4 stripped_strings gives me byte objects?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭