使用BeautifulSoup解析嵌套的div [英] Parsing nested divs with BeautifulSoup

查看:77
本文介绍了使用BeautifulSoup解析嵌套的div的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用文本,表格和html解析许多网页.每个页面具有不同数量的段落,但是尽管每个段落都以开头< div> 开头,但是结尾</div> 直到结尾才出现.我只是想获取内容,过滤掉某些元素并用其他东西代替

I'm trying to parse a number of web pages with text, tables and html. Every page has a different number of paragraphs, but while every paragraph begins with an opening <div>, the closing </div> does not occur until the end. I'm just trying to get the content, filtering out certain elements and replacing them by something else

所需结果: text1< b> text2</b>(table_deleted)text3

实际结果 text1 \ n \ ntext2一些文本heretext 3text2一些文本heretext 3(已删除表格)

from bs4 import BeautifulSoup

html = """
<h1>title</h1>
<h3>extra data</h3>
<div>
    text1
    <div>
        <b>next2</b><table>some text here</table>text 3
    </div>
</div>"""

soup = BeautifulSoup(html, 'html5lib')
tags = soup.find('h3').find_all_next()
contents = ""
for tag in tags:
    if tag.name == 'table':
        contents += " (table deleted) "

    contents += tag.text.strip()

print(contents)

推荐答案

请勿使用 html5lib 作为解析器,而应使用 html.parser .话虽如此,您可以使用 css选择器 select_one 方法.

Don't use html5lib as parser instead use html.parser. That being said, you can access the "div" that is immediately after your "h3" tag using a css selector and the select_one method.

从那里,您可以 展开> 以下"div"标记,并使用 方法替换

From there, you can unwrap the following "div" tag and replace the "table" tag using the replace_with method

In [107]: from bs4 import BeautifulSoup

In [108]: html = """
     ...: <h1>title</h1>
     ...: <h3>extra data</h3>
     ...: <div>
     ...:     text1
     ...:     <div>
     ...:         <b>next2</b><table>some text here</table>text 3
     ...:     </div>
     ...: </div>"""

In [109]: soup = BeautifulSoup(html, 'html.parser')

In [110]: my_div = soup.select_one('h3 + div')

In [111]: my_div
Out[111]: 
<div>
    text1
    <div>
<b>next2</b><table>some text here</table>text 3
    </div>
</div>

In [112]: my_div.div.unwrap()
Out[112]: <div></div>

In [113]: my_div
Out[113]: 
<div>
    text1

<b>next2</b><table>some text here</table>text 3

</div>

In [114]: my_div.table.replace_with('(table deleted)')
Out[114]: <table>some text here</table>

In [115]: my_div
Out[115]: 
<div>
    text1

<b>next2</b>(table deleted)text 3

</div>

这篇关于使用BeautifulSoup解析嵌套的div的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆