BeautifulSoup`find_all`生成器 [英] BeautifulSoup `find_all` generator
问题描述
是否可以将find_all
转换为内存效率更高的生成器?例如:
给出:
soup = BeautifulSoup(content, "html.parser")
return soup.find_all('item')
我想改用:
soup = BeautifulSoup(content, "html.parser")
while True:
yield soup.next_item_generator()
(假定正确处理了最后的StopIteration
异常)
有一些内置的生成器,但是不会在查找中产生下一个结果. find
仅返回第一项. find_all
具有成千上万个项目,占用了很多内存.对于5792个项目,我看到的是RAM刚刚超过1GB的峰值.
我很清楚,有更有效的解析器(例如lxml)可以完成此任务.让我们假设还有其他业务限制因素使我无法使用其他任何东西.
如何将find_all
转换为生成器,以更有效地实现内存的方式遍历.
BeautifulSoup
中没有"find"生成器,但是我们可以结合使用 SoupStrainer
和 .children
生成器.
假设我们有以下示例HTML:
<div>
<item>Item 1</item>
<item>Item 2</item>
<item>Item 3</item>
<item>Item 4</item>
<item>Item 5</item>
</div>
我们需要从中获取所有item
节点的文本.
我们可以使用SoupStrainer
仅解析item
标记,然后遍历.children
生成器并获取文本:
from bs4 import BeautifulSoup, SoupStrainer
data = """
<div>
<item>Item 1</item>
<item>Item 2</item>
<item>Item 3</item>
<item>Item 4</item>
<item>Item 5</item>
</div>"""
parse_only = SoupStrainer('item')
soup = BeautifulSoup(data, "html.parser", parse_only=parse_only)
for item in soup.children:
print(item.get_text())
打印:
Item 1
Item 2
Item 3
Item 4
Item 5
换句话说,想法是将树砍成所需的标签并使用 Is there any way to turn Given: I would like to instead use: (assume proper handing of the final There are some generators built in, but not to yield the next result in a find. I am well aware that there are more efficient parsers, such as lxml, that can accomplish this. Let's assume that there are other business constraints preventing me from using anything else. How can I turn There is no "find" generator in Let's imagine we have this sample HTML: from which we need to get the text of all We can use the Prints: In other words, the idea is to cut the tree down to the desired tags and use one of the available generators, like The 这篇关于BeautifulSoup`find_all`生成器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!.descendants
递归地生成子元素,而.children
仅考虑节点的直接子代.find_all
into a more memory efficient generator? For example: soup = BeautifulSoup(content, "html.parser")
return soup.find_all('item')
soup = BeautifulSoup(content, "html.parser")
while True:
yield soup.next_item_generator()
StopIteration
exception)find
returns just the first item. With thousands of items, find_all
sucks up a lot of memory. For 5792 items, I'm seeing a spike of just over 1GB of RAM.find_all
into a generator to iterate through in a more memory efficient way.BeautifulSoup
, from what I know, but we can combine the use of SoupStrainer
and .children
generator.<div>
<item>Item 1</item>
<item>Item 2</item>
<item>Item 3</item>
<item>Item 4</item>
<item>Item 5</item>
</div>
item
nodes.SoupStrainer
to parse only the item
tags and then iterate over the .children
generator and get the texts:from bs4 import BeautifulSoup, SoupStrainer
data = """
<div>
<item>Item 1</item>
<item>Item 2</item>
<item>Item 3</item>
<item>Item 4</item>
<item>Item 5</item>
</div>"""
parse_only = SoupStrainer('item')
soup = BeautifulSoup(data, "html.parser", parse_only=parse_only)
for item in soup.children:
print(item.get_text())
Item 1
Item 2
Item 3
Item 4
Item 5
.children
. You can also use one of these generators directly and manually filter the tag by name or other criteria inside the generator body, e.g. something like:def generate_items(soup):
for tag in soup.descendants:
if tag.name == "item":
yield tag.get_text()
.descendants
generates the children elements recursively, while .children
would only consider direct children of a node.