BeautifulSoup删除嵌套标签 [英] BeautifulSoup removing nested tags

查看:123
本文介绍了BeautifulSoup删除嵌套标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用BeautifulSoup制作通用刮板,为此我试图检测可以直接在其下使用文本的标签.

I am trying to make a generic scraper using BeautifulSoup for which I am trying to detect the tag under which directly text is available.

请考虑以下示例:

<body>
<div class="c1">
    <div class="c2">
        <div class="c3">
            <div class="c4">
                <div class="c5">
                    <h1> A heading for section </h1>
                </div>
                <div class="c5">
                    <p> Some para </p>
                </div>
                <div class="c5">
                    <h2> Sub heading </h2>
                    <p> <span> Blah Blah </span> </p>
                </div>
            </div>
        </div>
    </div>
</div>
</body>

这里,我的目标是提取(具有c4类的div),因为它具有所有文本内容.div其余的c1-c3只是我的包装器.

Here my objective is to extract (div with class c4) as it has all the textual content. Rest of the div before it c1 - c3 are just wrappers for me.

一种识别节点的可能方法是:

One possible way for identifying the node, I came up is:

if node.find(re.compile("^h[1-6]"), recursive=False) is not None:
    return node.parent.parent

但是这种情况太具体了.

But it is too specific for this case.

有没有一种优化的方法可以在一级递归中查找文本.即如果我做类似的事情

Is there any optimized way for finding text in one level of recursion. i.e. if I do something like

node.find(text=True, recursion_level=1)

然后,它应该返回仅考虑直子的文本.

then it should return text considering only immediate children.

到目前为止,我的解决方案尚不确定,是否适用于所有情况.

My solution so far, not sure if it holds for all cases.

def check_for_text(node):
    return node.find(text=True, recursive=False)

def check_1_level_depth(node):
    if check_for_text(node):
        return check_for_text(node)

    return map(check_for_text, node.children)

对于上面的代码:node是汤中的一个元素,当前正在检查中,即div,span等.请假定我正在处理check_for_text()中的所有异常(AttributeError:"NavigableString")

For the code above: node is an element of soup that is currently under check, i.e. div, span, etc. Please assume that I am handling all exceptions in check_for_text() (AttributeError: 'NavigableString')

推荐答案

原来,我不得不编写一个递归函数以消除带有单个孩子的标签.这是代码:

Turns out I have to write a recursive function to eliminate the tags with a single child. Here is the code:

# Pass soup.body in following
def process_node(node):
    if type(node) == bs4.element.NavigableString:
        return node.text
    else:
        if len(node.contents) == 1:
            return process_node(node.contents[0])
        elif len(node.contents) > 1:
            return map(process_node, node.children)

到目前为止,它运行良好且速度很快.

So far it is working good and fast.

这篇关于BeautifulSoup删除嵌套标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆