BeautifulSoup-摆脱段落空白/换行符 [英] BeautifulSoup - getting rid of paragraph whitespace/line breaks

查看:88
本文介绍了BeautifulSoup-摆脱段落空白/换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

similarlist = res.find_all_next("div", class_="result-wrapper")
for item in similarlist:
    print(item)

这将返回:

<div class="result-wrapper">
<div class="row-fluid result-row">
<div class="span6 result-left">
<p>
<a class="tooltipLink warn-cs" data-original-title="Listen" href="..." rel="tooltip"><i class="..."></i></a>
<a class="muted-link" href="/dictionary/german-english/aa-machen">Aa <b>machen</b></a>
</p>
</div>   
<div class="span6 result-right row-fluid">
<span class="span9">
<a class="muted-link" href="/dictionary/english-german/do-a-poo">to do a poo</a>, <a class="muted-link" href="/dictionary/english-german/pooh">to pooh</a>
</span>
</div>
</div>
</div>

当我选择打印item.get_text()时,我得到了

When I choose to print item.get_text() instead, I get

abgeneigt machen
to disincline




abhängig machen
2137

to predicate




Absenker machen
to layer

因此,基本上不需要列表项之间的许多新行.这是因为<p>标签吗?我该如何摆脱它们?

So basically a lot of new lines between the list items that I don't need. Is this because of the <p> tags? How do I get rid of them?

推荐答案

是的,标记之间的HTML也包含空格(包括换行符).

Yes, between tags the HTML contains whitespace (including newlines) too.

您可以使用正则表达式轻松折叠所有多行空格:

You can easily collapse all multi-line whitespace with a regular expression:

import re

re.sub(r'\n\s*\n', r'\n\n', item.get_text().strip(), flags=re.M)

这会删除两个换行符之间的所有空格(换行符,空格,制表符等).

This removes any whitespace (newlines, spaces, tabs, etc.) between two newlines.

这篇关于BeautifulSoup-摆脱段落空白/换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆