使用BeautifulSoup解析嵌套的div [英] Parsing nested divs with BeautifulSoup
问题描述
我正在尝试使用文本,表格和html解析许多网页.每个页面具有不同数量的段落,但是尽管每个段落都以开头< div>
开头,但是结尾</div>
直到结尾才出现.我只是想获取内容,过滤掉某些元素并用其他东西代替
I'm trying to parse a number of web pages with text, tables and html. Every page has a different number of paragraphs, but while every paragraph begins with an opening <div>
, the closing </div>
does not occur until the end. I'm just trying to get the content, filtering out certain elements and replacing them by something else
所需结果: text1< b> text2</b>(table_deleted)text3
实际结果 text1 \ n \ ntext2一些文本heretext 3text2一些文本heretext 3(已删除表格)
from bs4 import BeautifulSoup
html = """
<h1>title</h1>
<h3>extra data</h3>
<div>
text1
<div>
<b>next2</b><table>some text here</table>text 3
</div>
</div>"""
soup = BeautifulSoup(html, 'html5lib')
tags = soup.find('h3').find_all_next()
contents = ""
for tag in tags:
if tag.name == 'table':
contents += " (table deleted) "
contents += tag.text.strip()
print(contents)
推荐答案
请勿使用 html5lib
作为解析器,而应使用 html.parser
.话虽如此,您可以使用 css选择器和 select_one
方法.
Don't use html5lib
as parser instead use html.parser
. That being said, you can access the "div" that is immediately after your "h3" tag using a css selector and the select_one
method.
从那里,您可以 展开
> 以下"div"标记,并使用 用
方法替换
From there, you can unwrap
the following "div" tag and replace the "table" tag using the replace_with
method
In [107]: from bs4 import BeautifulSoup
In [108]: html = """
...: <h1>title</h1>
...: <h3>extra data</h3>
...: <div>
...: text1
...: <div>
...: <b>next2</b><table>some text here</table>text 3
...: </div>
...: </div>"""
In [109]: soup = BeautifulSoup(html, 'html.parser')
In [110]: my_div = soup.select_one('h3 + div')
In [111]: my_div
Out[111]:
<div>
text1
<div>
<b>next2</b><table>some text here</table>text 3
</div>
</div>
In [112]: my_div.div.unwrap()
Out[112]: <div></div>
In [113]: my_div
Out[113]:
<div>
text1
<b>next2</b><table>some text here</table>text 3
</div>
In [114]: my_div.table.replace_with('(table deleted)')
Out[114]: <table>some text here</table>
In [115]: my_div
Out[115]:
<div>
text1
<b>next2</b>(table deleted)text 3
</div>
这篇关于使用BeautifulSoup解析嵌套的div的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!