如何使用 BeautifulSoup 获取两个指定标签之间的所有文本? [英] How to get all text between just two specified tags using BeautifulSoup?

查看:47
本文介绍了如何使用 BeautifulSoup 获取两个指定标签之间的所有文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

html = """
...
<tt class="descname">all</tt>
<big>(</big>
<em>iterable</em>
<big>)</big>
<a class="headerlink" href="#all" title="Permalink to this definition">¶</a>
...
"""

我想在第一次出现 a 标记之前获取起始标记 big 之间的所有文本.这意味着如果我拿这个例子,那么我必须将 (iterable) 作为一个字符串.

I want to get all text between starting tag big upto before the first occurrence of a tag. This means if I take this example, then i must get (iterable) as a string.

推荐答案

我会避免 nextSibling,因为从你的问题,你想包括一切,直到下一个 ,不管无论是在兄弟、父元素还是子元素中.

I would avoid nextSibling, as from your question, you want to include everything up until the next <a>, regardless of whether that is in a sibling, parent or child element.

因此,我认为最好的方法是找到作为下一个 元素的节点并递归循环直到那时,添加遇到的每个字符串.如果您的 HTML 与示例有很大不同,您可能需要整理以下内容,但这样的事情应该可以工作:

Therefore I think the best approach is to find the node that is the next <a> element and loop recursively until then, adding each string as encountered. You may need to tidy up the below if your HTML is vastly different from the sample, but something like this should work:

from bs4 import BeautifulSoup
#by taking the `html` variable from the question.
html = BeautifulSoup(html)
firstBigTag = html.find_all('big')[0]
nextATag = firstBigTag.find_next('a')
def loopUntilA(text, firstElement):
    text += firstElement.string
    if (firstElement.next.next == nextATag):             
        return text
    else:
        #Using double next to skip the string nodes themselves
        return loopUntilA(text, firstElement.next.next)
targetString = loopUntilA('', firstBigTag)
print targetString

这篇关于如何使用 BeautifulSoup 获取两个指定标签之间的所有文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆