替换< br>在BeautifulSoap输出中有空格 [英] Replace <br> with space in BeautifulSoap output

查看:44
本文介绍了替换< br>在BeautifulSoap输出中有空格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用BeautifulSoap抓取一些链接,但是,它似乎完全忽略了< br> 标签.

I am scraping a few links with BeautifulSoap however, it seems to completely ignore <br> tags.

这是我要删除的URL的源代码的相关部分:

Here is the relevant portion of source code of the URL I am scrapping:

<h1 class="para-title">A quick brown fox jumps over<br>the lazy dog
<span id="something">&#xe800;</span></h1>

这是我的BeautifulSoap代码(仅相关部分),用于在 h1 标签中获取文本:

Here is my BeautifulSoap code (relevant part only) to get the text within h1 tags:

    soup = BeautifulSoup(page, 'html.parser')
    title_box = soup.find('h1', attrs={'class': 'para-title'})
    title = title_box.text.strip()
    print title

这将提供以下输出:

    A quick brown fox jumps overthe lazy dog

我希望如此:

    A quick brown fox jumps over the lazy dog

如何在代码中用 space 替换< br> ?

推荐答案

如何将 .get_text()与分隔符参数一起使用?

How about using the .get_text() with the separator parameter?

from bs4 import BeautifulSoup

page = '''<h1 class="para-title">A quick brown fox jumps over<br>the lazy dog
<span>some stuff here</span></h1>'''


soup = BeautifulSoup(page, 'html.parser')
title_box = soup.find('h1', attrs={'class': 'para-title'})
title = title_box.get_text(separator=" ").strip()
print (title)   

输出:

print (title)
A quick brown fox jumps over the lazy dog
 some stuff here

这篇关于替换&lt; br&gt;在BeautifulSoap输出中有空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆