使BeautifulSoup句柄像浏览器一样换行 [英] Make BeautifulSoup handle line breaks as a browser would

查看:172
本文介绍了使BeautifulSoup句柄像浏览器一样换行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用BeautifulSoup(Python 3.4的版本为"4.3.2")将html文档转换为文本.我遇到的问题是,有时网页上的换行符"\ n"实际上不会在浏览器中显示为换行符,但是当BeautifulSoup将它们转换为文本时,它会留在"\ n"中.

I'm using BeautifulSoup (version '4.3.2' with Python 3.4) to convert html documents to text. The problem I'm having is that sometimes web pages have newline characters "\n" that wouldn't actually get rendered as a new line in a browser, but when BeautifulSoup converts them to text, it leaves in the "\n".

示例:

您的浏览器可能会在一行中呈现以下所有内容(即使中间包含换行符):

Your browser probably renders the following all in one line (even though have a newline character in the middle):

这是一个 段落.

即使您输入的内容中没有换行符,您的浏览器也可能会以多行显示以下内容:

And your browser probably renders the following in multiple lines even though I'm entering it with no newlines:

这是一个段落.

这是另一个段落.

但是,当BeautifulSoup将相同的字符串转换为文本时,它使用的唯一换行符是换行文字-并且始终使用它们:

But when BeautifulSoup converts the same strings to text, the only line line breaks it uses are the newline literals - and it always uses them:

from bs4 import BeautifulSoup

doc = "<p>This is a\nparagraph.</p>"
soup = BeautifulSoup(doc)

soup.text
Out[181]: 'This is a \n paragraph.'

doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
soup = BeautifulSoup(doc)

soup.text
Out[187]: 'This is a paragraph.This is another paragraph.'

有人知道如何使BeautifulSoup以更漂亮的方式提取文本(或实际上只是使所有换行符正确)吗?还有其他解决问题的简单方法吗?

Does anyone know how to make BeautifulSoup extract text in a more beautiful way (or really just get all the newlines correct)? Are there any other simple ways around the problem?

推荐答案

get_text在这里可能会有所帮助:

get_text might be helpful here:

>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="\n")
u'This is a paragraph.\nThis is another paragraph.'

这篇关于使BeautifulSoup句柄像浏览器一样换行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆