如何对bs4中的find_all()标签进行部分条件处理? [英] How to do a partial conditioning on a tag for find_all() in bs4?

查看:40
本文介绍了如何对bs4中的find_all()标签进行部分条件处理?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个xml,其中包含多个看起来像这样的标签:

I have an xml which has multiple tags which look like this:

< textblock height ="55"hpos ="143"id ="Page1_Block5";lang ="en-US"stylerefs ="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1"vpos ="226"宽度="393">

我想获取所有由Page聚类的< textblock> 标记( textblock 标记中的 id 属性).但是,我的ID是通过以下方式编写的: id ="Page1_Block5" .

I want to get all the <textblock> tags clustered by a Page (id property in the textblock tag). However, my id is written in the following way: id="Page1_Block5".

但是,我只希望以页码为条件,而不是以块数为条件.(我想要特定页面的所有块).

However, I want to condition only on the Page number, and not the block number. (I want all blocks of a specific page).

我正在尝试通过以下方式进行操作:

I am trying to do the same via:

xml_soup = bs.BeautifulSoup(table, 'lxml')

text_blocks = xml_soup.find_all('textblock')

我需要在我的 find_all()函数中添加更多参数,以便仅在 Page {} 上对我的结果进行条件设置吗?

What more parameters would I need to add inside my find_all() function to be able to condition my results only on the Page{}?

推荐答案

这对您有帮助:

text_blocks = xml_soup.find_all('textblock', id = lambda value: value and value.startswith("Page1"))

这是我的完整代码:

from bs4 import BeautifulSoup

xml = """
<textblock height="55" hpos="143" id="Page1_Block5" lang="en-US" stylerefs="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1" vpos="226" width="393">
"""

xml_soup = BeautifulSoup(xml,'lxml')

text_blocks = xml_soup.find_all('textblock', id = lambda value: value and value.startswith("Page1"))

说明:

lambda函数检查 id 是否以 Page1 开头.如果是,则它检索标签.我还为 xml 变量添加了一些其他值.这是我使用的测试数据:

The lambda function checks whether the id starts with Page1. If yes, then it retrieves the tag. I have also added few more values to the xml variable. Here is the test data that I used:

xml = """
<textblock height="55" hpos="143" id="Page1_Block5" lang="en-US" stylerefs="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1" vpos="226" width="393">
<textblock height="55" hpos="143" id="Page1_Block4" lang="en-US" stylerefs="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1" vpos="226" width="393">
<textblock height="55" hpos="143" id="Page2_Block5" lang="en-US" stylerefs="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1" vpos="226" width="393">
<textblock height="55" hpos="143" id="Page1_Block1" lang="en-US" stylerefs="StyleId-E6BF91A3-3D6A-442F-9A46-22A0459A02E9- font1" vpos="226" width="393">
"""

您可以看到,有3个 textblock 标记,其 id Page1 开头.当我使用此测试数据运行代码并打印出变量 text_blocks 的长度时,这就是我得到的输出:

As u can see, there are 3 textblock tags with an id that starts with Page1. When I ran my code using this test data and printed out the length of the variable text_blocks, this is the output that I got:

>>> len(text_blocks)
3

这表明该代码有效!希望这会有所帮助!

This shows that the code works! Hope that this helps!

PS:您可以参考

P.S: U can refer to this link for more details about extracting elements with an id that starts with a particular string.

这篇关于如何对bs4中的find_all()标签进行部分条件处理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆