BeautifulSoup计数标签里面没有他们深解析 [英] BeautifulSoup counting tags without parsing deep inside them

查看:221
本文介绍了BeautifulSoup计数标签里面没有他们深解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想到了以下<一个href=\"http://stackoverflow.com/questions/27673349/python-xml-parsing-algorithm-speed/27673558#27673558\">while写一个答案。

假设我有一个深度嵌套 XML 文件中像这样(但更嵌套和更长的时间):

 &lt;节名称=1&GT;
    &LT;节名称为foo&GT;
        &LT; subsubsection NAME =酒吧&GT;
            &LT;更深层次的名字=哎&GT;
                &LT; much_deeper名哟&GT;
                    &LT;立GT;有些内容与LT; /李&GT;
                &LT; / much_deeper&GT;
            &LT; /更深&GT;
        &LT; / subsubsection&GT;
    &LT; /款中,GT;
&LT; /节&gt;
&lt;节名称=2&GT;
    ...等等
&LT; /节&gt;

问题LEN(soup.find_all(部分))是在做 find_all(部分),BS不断深进搜索,我知道不会包含任何其他部分的代码标记。

于是,两个问题:


  1. 有没有一种方法,使BS的不可以递归搜索到一个已经发现标签?

  2. 如果答案1是肯定的,这将是更有效的或者是相同的内部流程?


解决方案

BeautifulSoup 不能给你只是它发现标签的计数/数量。

你什么,不过,可以改善的是:不要让 BeautifulSoup 去其他章节内搜索部分通过传递递归=假

  LEN(soup.find_all(小节,递归= FALSE))

除此之外的改进, LXML 将做的工作速度快:

  tree.xpath('计数(//部分))

I thought about the following while writing an answer to this question.

Suppose I have a deeply nested xml file like this (but much more nested and much longer):

<section name="1">
    <subsection name"foo">
        <subsubsection name="bar">
            <deeper name="hey">
                <much_deeper name"yo">
                    <li>Some content</li>
                </much_deeper>
            </deeper>
        </subsubsection>
    </subsection>
</section>
<section name="2">
    ... and so forth
</section>

The problem with len(soup.find_all("section")) is that while doing find_all("section"), BS keeps searching deep into a tag that I know won't contain any other section tag.

So, two questions:

  1. Is there a way to make BS not search recursively into an already found tag?
  2. If the answer to 1 is yes, will it be more efficient or is it the same internal process?

解决方案

BeautifulSoup cannot give you just a count/number of tags it found.

What you, though, can improve is: don't let BeautifulSoup go searching sections inside other sections by passing recursive=False:

len(soup.find_all("section", recursive=False))

Aside from that improvement, lxml would do the job faster:

tree.xpath('count(//section)')

这篇关于BeautifulSoup计数标签里面没有他们深解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆