通过与BeautifulSoup / Python中的DOM遍历 [英] Iterating through a DOM with BeautifulSoup/Python

查看:509
本文介绍了通过与BeautifulSoup / Python中的DOM遍历的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这样的DOM:

<h2>Main Section</h2>
<p>Bla bla bla<p>
<h3>Subsection</h3>
<p>Some more info</p>

<h3>Subsection 2</h3>
<p>Even more info!</p>


<h2>Main Section 2</h2>
<p>bla</p>
<h3>Subsection</h3>
<p>Some more info</p>

<h3>Subsection 2</h3>
<p>Even more info!</p>

我想生成返回主科,唧唧歪歪,分段等一个迭代器有没有办法将它与BeautifulSoup?

I'd like to generate an iterator that returns 'Main Section', 'Bla bla bla', 'Subsection', etc. Is there a way to this with BeautifulSoup?

推荐答案

下面是做到这一点的方法之一。我们的想法是遍历主要部分( H2 标签),并为每个 H2 标记迭代的兄弟姐妹,直到明年 H2 标签:

Here's one way to do it. The idea is to iterate over main sections (h2 tag) and for every h2 tag iterate over siblings until next h2 tag:

from bs4 import BeautifulSoup, Tag


data = """<h2>Main Section</h2>
<p>Bla bla bla<p>
<h3>Subsection</h3>
<p>Some more info</p>

<h3>Subsection 2</h3>
<p>Even more info!</p>


<h2>Main Section 2</h2>
<p>bla</p>
<h3>Subsection</h3>
<p>Some more info</p>

<h3>Subsection 2</h3>
<p>Even more info!</p>"""


soup = BeautifulSoup(data)
for main_section in soup.find_all('h2'):
    for sibling in main_section.next_siblings:
        if not isinstance(sibling, Tag):
            continue
        if sibling.name == 'h2':
            break
        print sibling.text
    print "-------"

打印:

Bla bla bla


Subsection
Some more info
Subsection 2
Even more info!
-------
bla
Subsection
Some more info
Subsection 2
Even more info!
-------

希望有所帮助。

这篇关于通过与BeautifulSoup / Python中的DOM遍历的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆