使用BeautifulSoup在两个h2标头之间获取文本 [英] Get text in between two h2 headers using BeautifulSoup

查看:36
本文介绍了使用BeautifulSoup在两个h2标头之间获取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想获取描述"之后和下一个标题"之前的文本.

I want to grab the text that comes after Description and before the Next Header.

我知道:

In [8]: soup.findAll('h2')[6]
Out[8]: <h2>Description</h2>

但是,我不知道如何获取实际文本.问题是我有多个链接可以执行此操作.有些有p:

However, I don’t know how to grab the actual text. The problem is I have multiple links to do this on. Some have the p:

                                         <h2>Description</h2>

  <p>This is the text I want </p>
<p>This is the text I want</p>   
                                        <h2>Next header</h2>

但是,有些不是:

>                                       <h2>Description</h2>
>                        This is the text I want                 
> 
>                                       <h2>Next header</h2>

在每个带有p的人上,我也不能只做soup.findAll('p')[22],因为在某些情况下,'p'是21或20.

Also on each one with the p, I can’t just do soup.findAll(‘p’)[22] because on some the ‘p’ is at 21 or 20.

推荐答案

检查 NavigableString 来检查下一个兄弟姐妹是文本节点还是 Tag 来检查它是否是文本节点.是一个元素.

Check for NavigableString to check if the next sibling is a text node or Tag to check if it is an element.

如果下一个兄弟是标头,则中断循环.

Break the loop if your next sibling is an header.

from bs4 import BeautifulSoup, NavigableString, Tag
import requests

example = """<h2>Description</h2><p>This is the text I want </p><p>This is the text I want</p><h2>Next header</h2>"""

soup = BeautifulSoup(example, 'html.parser')
for header in soup.find_all('h2'):
    nextNode = header
    while True:
        nextNode = nextNode.nextSibling
        if nextNode is None:
            break
        if isinstance(nextNode, NavigableString):
            print (nextNode.strip())
        if isinstance(nextNode, Tag):
            if nextNode.name == "h2":
                break
            print (nextNode.get_text(strip=True).strip())

这篇关于使用BeautifulSoup在两个h2标头之间获取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆