使用bs4来查找具有文本的html标签(h2) [英] using bs4 to find a html tag (h2) having text

查看:179
本文介绍了使用bs4来查找具有文本的html标签(h2)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这部分html代码:

  html3 =< a name =definition><<<<< ; / a> 
< h2>< span class =sectioncount> 3.342.2323< / span>内容逻辑定义< a标题=链接到此处class =self-link =valueset-investigation>< img src =ta.png/>< / a>< / h2>
< hr>
< div><< ; p来自以下< / p>< ul>< li>包含http://snomed.info/sct<table><tr><td><b>代码< / td>< / td>< td>< b> Display< / b>< / td>< / tr>< tr>< td> 34353553< / td>< td&标志< / TD>< TD />< / TR>< TR>< TD> 35453453453< / TD>< TD>历史/症状< / TD>< TD />< / TR> ;< / table>< / li>< / ul>< / div>
< p>< / p>
pre>




我将使用beautifulsoup来查找h2,其文本等于C意图逻辑定义和下一个兄弟姐妹。但美丽的女孩找不到h2。以下是我的代码:

  soup = BeautifulSoup(html3,lxml)
f = soup.find(这是一个错误:

  AttributeError:'NoneType'对象没有属性'nextsibilings'

文本中有几个h2,但唯一使h2独一无二的字符是内容逻辑定义。找到这个h2后,我将从表格中提取数据并在其下面列出。

解决方案

主要问题在于您定位 h2 元素的方式从中找到兄弟姐妹。我会使用功能,而不是检查 Content Logical Definition 在文本中:

  soup.find(lambda elm:elm .name ==h2和Content Logical Definitionin elm.text)

获得下一个兄弟姐妹,你应该使用 .next_siblings 而不是 nextsibilings



演示:

 >>> from bs4 import BeautifulSoup 
>>> html3 =< a name =definition>< / a>
...< h2>< span class =sectioncount> 3.342.2323< / span>内容逻辑定义< a title =链接到此处class =self-linkhref =valueset-investigation>< img src =ta.png/>< / a>< / h2>
...< hr />
...< div>< p from ...< p>< / p>
>>>汤= BeautifulSoup(html3,lxml)
>>> h2 = soup.find(elm.text中的lambda elm:elm.name ==h2和Content Logical Definition)
>>>为兄弟在h2.next_siblings:
...打印(兄弟姐妹)
...
< hr />
< div>< p following =from =the =>< / p>< ul>< li>包括http:// snomed中定义的这些代码。方式/ SCT<表>< TR>< TD>< b取代;代码< / b>< / TD>< TD>< b取代;显示< / b>< / TD>< / TR> ;< tr>< td> 34353553< / td>< td>检查/符号< / td>< td>< / td>< / tr>< tr>< TD>< TD>历史/症状< / TD>< TD>< / TD>< / TR>< /表>< /立GT;< / UL>< / DIV>
< p> < / p为H.






虽然现在知道你正在处理的HTML我认为你应该迭代兄弟姐妹,打破下一个 h2 或者如果你发现一个之前。实际执行:

 从bs4导入请求
导入BeautifulSoup

url = [
'https://www.hl7.org/fhir/valueset-activity-reason.html',
'https://www.hl7.org/fhir/valueset-age-units.html'


在url中的网址:
r = requests.get(url)
汤= BeautifulSoup(r.content,'lxml')

h2 = soup.find(lambda elm:elm.name ==h2和Content Logical Definitionin elm.text)
table = None
在h2.find_next_siblings()中用于同级:
如果sibling.name ==table:
table = sibling
break
如果sibling.name ==h2:
break
print (表)


for this part of html code:

html3= """<a name="definition"> </a>
<h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
<hr/>
<div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
<p> </p>"""


I am going to use beautifulsoup to find h2 that its text equals to "Content Logical Definition" and next siblings. But beautifulsoup can not find h2. The following is my code:

soup = BeautifulSoup(html3, "lxml")
f= soup.find("h2", text = "Content Logical Definition").nextsibilings

This is an error:

AttributeError: 'NoneType' object has no attribute 'nextsibilings'

There are several "h2" in the text, but the only character that makes this h2 unique is "Content Logical Definition". After finding this h2, I am going to extract data from the table and list under it.

解决方案

The main problem is the way you are locating the h2 element to find siblings from. I'd use a function instead checking that Content Logical Definition is inside the text:

soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)

Also, to get the next siblings you should use the .next_siblings and not nextsibilings.

Demo:

>>> from bs4 import BeautifulSoup
>>> html3= """<a name="definition"> </a>
... <h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
... <hr/>
... <div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
... <p> </p>"""
>>> soup = BeautifulSoup(html3, "lxml")
>>> h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
>>> for sibling in h2.next_siblings:
...     print(sibling)
... 
<hr/>
<div><p following="" from="" the=""></p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td></td></tr><tr><td>35453453453</td><td>History/symptoms</td><td></td></tr></table></li></ul></div>
<p> </p>


Though, now knowing the real HTML you are dealing with and how messed up can it be, I think you should be iterating over the siblings, break on the next h2 or if you find a table before that. Actual implementation:

import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.hl7.org/fhir/valueset-activity-reason.html',
    'https://www.hl7.org/fhir/valueset-age-units.html'
]

for url in urls:
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')

    h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
    table = None
    for sibling in h2.find_next_siblings():
        if sibling.name == "table":
            table = sibling
            break
        if sibling.name == "h2":
            break
    print(table)

这篇关于使用bs4来查找具有文本的html标签(h2)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆