BeautifulSoup计数标签里面没有他们深解析 [英] BeautifulSoup counting tags without parsing deep inside them
问题描述
我想到了以下<一个href=\"http://stackoverflow.com/questions/27673349/python-xml-parsing-algorithm-speed/27673558#27673558\">while写一个答案。
假设我有一个深度嵌套 XML
文件中像这样(但更嵌套和更长的时间):
&lt;节名称=1&GT;
&LT;节名称为foo&GT;
&LT; subsubsection NAME =酒吧&GT;
&LT;更深层次的名字=哎&GT;
&LT; much_deeper名哟&GT;
&LT;立GT;有些内容与LT; /李&GT;
&LT; / much_deeper&GT;
&LT; /更深&GT;
&LT; / subsubsection&GT;
&LT; /款中,GT;
&LT; /节&gt;
&lt;节名称=2&GT;
...等等
&LT; /节&gt;
与问题LEN(soup.find_all(部分))
是在做 find_all(部分)
,BS不断深进搜索,我知道不会包含任何其他部分的代码
标记。
于是,两个问题:
- 有没有一种方法,使BS的不可以递归搜索到一个已经发现标签?
- 如果答案1是肯定的,这将是更有效的或者是相同的内部流程?
BeautifulSoup
不能给你只是它发现标签的计数/数量。
你什么,不过,可以改善的是:不要让 BeautifulSoup
去其他章节内搜索部分通过传递递归=假
:
LEN(soup.find_all(小节,递归= FALSE))
除此之外的改进, LXML
将做的工作速度快:
tree.xpath('计数(//部分))
I thought about the following while writing an answer to this question.
Suppose I have a deeply nested xml
file like this (but much more nested and much longer):
<section name="1">
<subsection name"foo">
<subsubsection name="bar">
<deeper name="hey">
<much_deeper name"yo">
<li>Some content</li>
</much_deeper>
</deeper>
</subsubsection>
</subsection>
</section>
<section name="2">
... and so forth
</section>
The problem with len(soup.find_all("section"))
is that while doing find_all("section")
, BS keeps searching deep into a tag that I know won't contain any other section
tag.
So, two questions:
- Is there a way to make BS not search recursively into an already found tag?
- If the answer to 1 is yes, will it be more efficient or is it the same internal process?
BeautifulSoup
cannot give you just a count/number of tags it found.
What you, though, can improve is: don't let BeautifulSoup
go searching sections inside other sections by passing recursive=False
:
len(soup.find_all("section", recursive=False))
Aside from that improvement, lxml
would do the job faster:
tree.xpath('count(//section)')
这篇关于BeautifulSoup计数标签里面没有他们深解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!