Beautifulsoup 兄弟结构与 br 标签 [英] Beautifulsoup sibling structure with br tags

查看:17
本文介绍了Beautifulsoup 兄弟结构与 br 标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 BeautifulSoup Python 库解析 HTML 文档,但结构被 <br> 标签扭曲.让我给你举个例子.

输入 HTML:

一些文字 <br><跨度>更多的文字</span><br><跨度>和更多文本 </span>

BeautifulSoup 解释的 HTML:

一些文字<br><跨度>更多的文字</span><br><跨度>和更多文本 </span></br></br>

在源代码中,跨度可以被视为兄弟姐妹.解析后(使用默认解析器),跨度突然不再是兄弟,因为 br 标签成为结构的一部分.

我能想到的解决方案是在将 html 倒入 Beautifulsoup 之前完全剥离 <br> 标签,但这似乎不是很优雅,因为它需要我改变输入.有什么更好的方法来解决这个问题?

解决方案

最好的办法是 extract() 换行.这比你想象的要容易:)

<预><代码>>>>从 bs4 导入 BeautifulSoup 作为 BS>>>html = """

... 一些文字 <br>... <span>更多的文字</span><br>... <span>和更多文本 </span>... </div>""">>>汤 = BS(html)>>>对于soup.find_all('br') 中的换行符:... linebreak.extract()...<br/><br/>>>>打印汤.美化()<身体><div>一些文字<跨度>还有一些文字</span><跨度>和更多的文字</span>

I'm trying to parse a HTML document using the BeautifulSoup Python library, but the structure is getting distorted by <br> tags. Let me just give you an example.

Input HTML:

<div>
  some text <br>
  <span> some more text </span> <br>
  <span> and more text </span>
</div>

HTML that BeautifulSoup interprets:

<div>
  some text
  <br>
    <span> some more text </span>
    <br>
      <span> and more text </span>
    </br>
  </br>
</div>

In the source, the spans could be considered siblings. After parsing (using the default parser), the spans are suddenly no longer siblings, as the br tags became part of the structure.

The solution I can think of to solve this is to strip the <br> tags altogether, before pouring the html into Beautifulsoup, but that doesn't seem very elegant, as it requires me to change the input. What's a better way to solve this?

解决方案

Your best bet is to extract() the line breaks. It's easier than you think :).

>>> from bs4 import BeautifulSoup as BS
>>> html = """<div>
...   some text <br>
...   <span> some more text </span> <br>
...   <span> and more text </span>
... </div>"""
>>> soup = BS(html)
>>> for linebreak in soup.find_all('br'):
...     linebreak.extract()
... 
<br/>
<br/>
>>> print soup.prettify()
<html>
 <body>
  <div>
   some text
   <span>
    some more text
   </span>
   <span>
    and more text
   </span>
  </div>
 </body>
</html>

这篇关于Beautifulsoup 兄弟结构与 br 标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆