BeautifulSoup`find_all`生成器 [英] BeautifulSoup `find_all` generator

查看:136
本文介绍了BeautifulSoup`find_all`生成器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以将find_all转换为内存效率更高的生成器?例如:

给出:

soup = BeautifulSoup(content, "html.parser")
return soup.find_all('item')

我想改用:

soup = BeautifulSoup(content, "html.parser")
while True:
    yield soup.next_item_generator()

(假定正确处理了最后的StopIteration异常)

有一些内置的生成器,但是不会在查找中产生下一个结果. find仅返回第一项. find_all具有成千上万个项目,占用了很多内存.对于5792个项目,我看到的是RAM刚刚超过1GB的峰值.

我很清楚,有更有效的解析器(例如lxml)可以完成此任务.让我们假设还有其他业务限制因素使我无法使用其他任何东西.

如何将find_all转换为生成器,以更有效地实现内存的方式遍历.

据我所知,BeautifulSoup 中没有"find"生成器,但是我们可以结合使用 SoupStrainer .children生成器.

假设我们有以下示例HTML:

<div>
    <item>Item 1</item>
    <item>Item 2</item>
    <item>Item 3</item>
    <item>Item 4</item>
    <item>Item 5</item>
</div>

我们需要从中获取所有item节点的文本.

我们可以使用SoupStrainer仅解析item标记,然后遍历.children生成器并获取文本:

from bs4 import BeautifulSoup, SoupStrainer

data = """
<div>
    <item>Item 1</item>
    <item>Item 2</item>
    <item>Item 3</item>
    <item>Item 4</item>
    <item>Item 5</item>
</div>"""

parse_only = SoupStrainer('item')
soup = BeautifulSoup(data, "html.parser", parse_only=parse_only)
for item in soup.children:
    print(item.get_text())

打印:

Item 1
Item 2
Item 3
Item 4
Item 5

换句话说,想法是将树砍成所需的标签并使用

.descendants 递归地生成子元素,而.children仅考虑节点的直接子代.

Is there any way to turn find_all into a more memory efficient generator? For example:

Given:

soup = BeautifulSoup(content, "html.parser")
return soup.find_all('item')

I would like to instead use:

soup = BeautifulSoup(content, "html.parser")
while True:
    yield soup.next_item_generator()

(assume proper handing of the final StopIteration exception)

There are some generators built in, but not to yield the next result in a find. find returns just the first item. With thousands of items, find_all sucks up a lot of memory. For 5792 items, I'm seeing a spike of just over 1GB of RAM.

I am well aware that there are more efficient parsers, such as lxml, that can accomplish this. Let's assume that there are other business constraints preventing me from using anything else.

How can I turn find_all into a generator to iterate through in a more memory efficient way.

解决方案

There is no "find" generator in BeautifulSoup, from what I know, but we can combine the use of SoupStrainer and .children generator.

Let's imagine we have this sample HTML:

<div>
    <item>Item 1</item>
    <item>Item 2</item>
    <item>Item 3</item>
    <item>Item 4</item>
    <item>Item 5</item>
</div>

from which we need to get the text of all item nodes.

We can use the SoupStrainer to parse only the item tags and then iterate over the .children generator and get the texts:

from bs4 import BeautifulSoup, SoupStrainer

data = """
<div>
    <item>Item 1</item>
    <item>Item 2</item>
    <item>Item 3</item>
    <item>Item 4</item>
    <item>Item 5</item>
</div>"""

parse_only = SoupStrainer('item')
soup = BeautifulSoup(data, "html.parser", parse_only=parse_only)
for item in soup.children:
    print(item.get_text())

Prints:

Item 1
Item 2
Item 3
Item 4
Item 5

In other words, the idea is to cut the tree down to the desired tags and use one of the available generators, like .children. You can also use one of these generators directly and manually filter the tag by name or other criteria inside the generator body, e.g. something like:

def generate_items(soup):
    for tag in soup.descendants:
        if tag.name == "item":
            yield tag.get_text()

The .descendants generates the children elements recursively, while .children would only consider direct children of a node.

这篇关于BeautifulSoup`find_all`生成器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆