如何获得使用BeautifulSoup只是两个指定标签之间的所有文本? [英] How to get all text between just two specified tags using BeautifulSoup?

查看:3084
本文介绍了如何获得使用BeautifulSoup只是两个指定标签之间的所有文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

html = """
...
<tt class="descname">all</tt>
<big>(</big>
<em>iterable</em>
<big>)</big>
<a class="headerlink" href="#all" title="Permalink to this definition">¶</a>
...
"""

我想开始标记之间的所有文本高达 A 标记中第一次出现了。这意味着,如果我拿这个例子中,我必须把(迭代器)为字符串。

I want to get all text between starting tag big upto before the first occurrence of a tag. This means if I take this example, then i must get (iterable) as a string.

推荐答案

我会避免nextSibling,因为从你的问题,你要包括的一切,直到下一次&LT; A&GT; ,而不管是否是在同级,父母或子元素。

I would avoid nextSibling, as from your question, you want to include everything up until the next <a>, regardless of whether that is in a sibling, parent or child element.

因此​​,我认为最好的办法是找到那就是下一个&LT的节点;&GT; 元素和循环递归到那时,将每个字符串遇到过。您可能需要整理下面,如果你的HTML是从样品千差万别的,但这样的事情应该工作:

Therefore I think the best approach is to find the node that is the next <a> element and loop recursively until then, adding each string as encountered. You may need to tidy up the below if your HTML is vastly different from the sample, but something like this should work:

from bs4 import BeautifulSoup
#by taking the `html` variable from the question.
html = BeautifulSoup(html)
firstBigTag = html.find_all('big')[0]
nextATag = firstBigTag.find_next('a')
def loopUntilA(text, firstElement):
    text += firstElement.string
    if (firstElement.next.next == nextATag):             
        return text
    else:
        #Using double next to skip the string nodes themselves
        return loopUntilA(text, firstElement.next.next)
targetString = loopUntilA('', firstBigTag)
print targetString

这篇关于如何获得使用BeautifulSoup只是两个指定标签之间的所有文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆