如何获得“汤"并串联/加入它们? [英] How to get "subsoups" and concatenate/join them?

查看:66
本文介绍了如何获得“汤"并串联/加入它们?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个需要处理的HTML文档.我正在为此使用"beautifoulsoup".现在,我想从该文档中检索一些子汤",并将它们加入一个汤中,以便以后可以将其用作需要汤对象的函数的参数.

I have a HTML document I need to process. I'm using 'beautifoulsoup' for that. Now I would like to retrieve a few "subsoups" from that document and join them into one soup so I can later use it as a parameter for a function that expects a soup object.

如果不清楚,我给你举个例子...

If it's not clear, I'll give you an example...

from bs4 import BeautifulSoup

my_document = """
<html>
<body>

<h1>Some Heading</h1>

<div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>

<div id="second">
<p>A paragraph.</p>
<p>A paragraph.</p>
</div>

<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>

<p id="loner">A paragraph.</p>

</body>
</html>
"""

soup = BeautifulSoup(my_document)

# find the needed parts
first = soup.find("div", {"id": "first"})
third = soup.find("div", {"id": "third"})
loner = soup.find("p", {"id": "loner"})
subsoups = [first, third, loner]

# create a new (sub)soup
resulting_soup = do_some_magic(subsoups)

# use it in a function that expects a soup object and calls its methods
function_expecting_a_soup(resulting_soup)

目标是在resulting_soup中创建一个对象,使其具有以下内容:

The goal is to have an object in resulting_soup that is/behaves like a soup with the following content:

<div id="first">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<p>A paragraph.</p>
</div>

<div id="third">
<p>A paragraph.</p>
<a href="another_doc.html">A link</a>
<a href="yet_another_doc.html">A link</a>
</div>

<p id="loner">A paragraph.</p>

是否有方便的方法来做到这一点?如果有比"find()"更好的方法来检索"subsoups",我可以改用它.谢谢.

Is there a convenient way to do that? If there is a better way to retrieve the "subsoups" than find(), I can use it instead. Thanks.

更新

Wondercricket建议使用解决方案,该解决方案将包含找到的标签的字符串连接起来,然后再次将其解析为新的BeautifulSoup对象.尽管这是解决问题的一种可能方法,但是重新解析可能要比我想要的时间长,尤其是当我想检索其中的大部分文档并且需要处理许多此类文档时. find()返回bs4.element.Tag.没有办法将几个Tag连接成一个汤而不将Tag转换成字符串并解析该字符串吗?

There is a solution advised by Wondercricket that concatenates strings containing the found tags and parses them again into a new BeautifulSoup Object. While it's a possible way to solve the problem, the re-parsing may take longer than I'd like especially when I want to retrieve the most of them and there are many such documents I need to process. find() returns a bs4.element.Tag. Isn't there a way how to concatenate several Tags into one soup without converting the Tags to a string and parsing the string?

推荐答案

SoupStrainer would do exactly what you are asking about and, as a bonus, you'll get a performance boost since it would parse exactly what you want it to parse - not the complete document tree:

from bs4 import BeautifulSoup, SoupStrainer

parse_only = SoupStrainer(id=["first", "third", "loner"])
soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)

现在,soup对象将仅包含所需的元素:

Now, the soup object would contain only the desired elements:

<div id="first">
 <p>
  A paragraph.
 </p>
 <a href="another_doc.html">
  A link
 </a>
 <p>
  A paragraph.
 </p>
</div>
<div id="third">
 <p>
  A paragraph.
 </p>
 <a href="another_doc.html">
  A link
 </a>
 <a href="yet_another_doc.html">
  A link
 </a>
</div>
<p id="loner">
 A paragraph.
</p>


是否还可以不仅指定ID,还可以指定标签?例如,如果我想过滤所有带有class ="someclass的段落,但不过滤具有相同类别的div?

Is it also possible to specify not only ids but also tags? For example if I want to filter all paragraphs with class="someclass but not divs with the same class?

在这种情况下,您可以创建搜索功能以加入SoupStrainer的多个条件:

In this case, you can make a search function to join multiple criteria for the SoupStrainer:

from bs4 import BeautifulSoup, SoupStrainer, ResultSet

my_document = """
<html>
<body>

    <h1>Some Heading</h1>

    <div id="first">
    <p>A paragraph.</p>
    <a href="another_doc.html">A link</a>
    <p>A paragraph.</p>
    </div>

    <div id="second">
    <p>A paragraph.</p>
    <p>A paragraph.</p>
    </div>

    <div id="third">
    <p>A paragraph.</p>
    <a href="another_doc.html">A link</a>
    <a href="yet_another_doc.html">A link</a>
    </div>

    <p id="loner">A paragraph.</p>

    <p class="myclass">test</p>
</body>
</html>
"""

def search(tag, attrs):
    if tag == "p" and "myclass" in attrs.get("class", []):
        return tag

    if attrs.get("id") in ["first", "third", "loner"]:
        return tag


parse_only = SoupStrainer(search)

soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)

print(soup.prettify())

这篇关于如何获得“汤"并串联/加入它们?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆