使用bs4来查找具有文本的html标签（h2） [英] using bs4 to find a html tag (h2) having text

查看：179 发布时间：2018/6/21 14:26:33 python html beautifulsoup html-parsing bs4

本文介绍了使用bs4来查找具有文本的html标签（h2）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这部分html代码：

  html3 =< a name =definition><<<<< ; / a> 
< h2>< span class =sectioncount> 3.342.2323< / span>内容逻辑定义< a标题=链接到此处class =self-link =valueset-investigation>< img src =ta.png/>< / a>< / h2> 
< hr> 
< div><< ; p来自以下< / p>< ul>< li>包含http://snomed.info/sct<table><tr><td><b>代码< / td>< / td>< td>< b> Display< / b>< / td>< / tr>< tr>< td> 34353553< / td>< td&标志< / TD>< TD />< / TR>< TR>< TD> 35453453453< / TD>< TD>历史/症状< / TD>< TD />< / TR> ;< / table>< / li>< / ul>< / div> 
< p>< / p>
  pre> 
 
 
 
 
 我将使用beautifulsoup来查找h2，其文本等于C意图逻辑定义和下一个兄弟姐妹。但美丽的女孩找不到h2。以下是我的代码： 
 
 
  soup = BeautifulSoup（html3，lxml）
f = soup.find（这是一个错误：
  AttributeError：'NoneType'对象没有属性'nextsibilings'
  文本中有几个h2，但唯一使h2独一无二的字符是内容逻辑定义。找到这个h2后，我将从表格中提取数据并在其下面列出。  
 
解决方案
主要问题在于您定位 h2 元素的方式从中找到兄弟姐妹。我会使用功能，而不是检查 Content Logical Definition 在文本中： 
 
 
  soup.find（lambda elm：elm .name ==h2和Content Logical Definitionin elm.text）
  
获得下一个兄弟姐妹，你应该使用  .next_siblings  而不是 nextsibilings 。
 
 
 演示： 
 
 
 >>> from bs4 import BeautifulSoup 
>>> html3 =< a name =definition>< / a> 
 ...< h2>< span class =sectioncount> 3.342.2323< / span>内容逻辑定义< a title =链接到此处class =self-linkhref =valueset-investigation>< img src =ta.png/>< / a>< / h2> 
 ...< hr /> 
 ...< div>< p from  ...< p>< / p>
>>>汤= BeautifulSoup（html3，lxml）
>>> h2 = soup.find（elm.text中的lambda elm：elm.name ==h2和Content Logical Definition）
>>>为兄弟在h2.next_siblings：
 ...打印（兄弟姐妹）
 ... 
< hr /> 
< div>< p following =from =the =>< / p>< ul>< li>包括http：// snomed中定义的这些代码。方式/ SCT<表>< TR>< TD>< b取代;代码< / b>< / TD>< TD>< b取代;显示< / b>< / TD>< / TR> ;< tr>< td> 34353553< / td>< td>检查/符号< / td>< td>< / td>< / tr>< tr>< TD>< TD>历史/症状< / TD>< TD>< / TD>< / TR>< /表>< /立GT;< / UL>< / DIV> 
< p> < / p为H. 
  
 
 
 
 
 
 虽然现在知道你正在处理的HTML我认为你应该迭代兄弟姐妹，打破下一个 h2 或者如果你发现一个表之前。实际执行： 
 
 
 从bs4导入请求
导入BeautifulSoup 
 
 url = [
'https://www.hl7.org/fhir/valueset-activity-reason.html'，
'https://www.hl7.org/fhir/valueset-age-units.html'
 
 
在url中的网址：
r = requests.get（url）
汤= BeautifulSoup（r.content，'lxml'）
 
 h2 = soup.find（lambda elm：elm.name ==h2和Content Logical Definitionin elm.text）
 table = None 
在h2.find_next_siblings（）中用于同级： 
如果sibling.name ==table：
 table = sibling 
 break 
如果sibling.name ==h2：
 break 
 print （表）
  
 
for this part of html code: 
html3= """<a name="definition"> </a>
<h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
<hr/>
<div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
<p> </p>"""




I am going to use beautifulsoup to find h2 that its text equals to "Content Logical Definition" and next siblings. But beautifulsoup can not find h2. The following is my code: 
soup = BeautifulSoup(html3, "lxml")
f= soup.find("h2", text = "Content Logical Definition").nextsibilings
This is an error: 
AttributeError: 'NoneType' object has no attribute 'nextsibilings'
There are several "h2" in the text, but the only character that makes this h2 unique is "Content Logical Definition". After finding this h2, I am going to extract data from the table and list under it. 
 解决方案 
The main problem is the way you are locating the h2 element to find siblings from. I'd use a function instead checking that Content Logical Definition is inside the text:
soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
Also, to get the next siblings you should use the .next_siblings  and not nextsibilings.

Demo:
>>> from bs4 import BeautifulSoup
>>> html3= """<a name="definition"> </a>
... <h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
... <hr/>
... <div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
... <p> </p>"""
>>> soup = BeautifulSoup(html3, "lxml")
>>> h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
>>> for sibling in h2.next_siblings:
...     print(sibling)
... 
<hr/>
<div><p following="" from="" the=""></p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td></td></tr><tr><td>35453453453</td><td>History/symptoms</td><td></td></tr></table></li></ul></div>
<p> </p>




Though, now knowing the real HTML you are dealing with and how messed up can it be, I think you should be iterating over the siblings, break on the next h2 or if you find a table before that. Actual implementation:
import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.hl7.org/fhir/valueset-activity-reason.html',
    'https://www.hl7.org/fhir/valueset-age-units.html'
]

for url in urls:
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')

    h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
    table = None
    for sibling in h2.find_next_siblings():
        if sibling.name == "table":
            table = sibling
            break
        if sibling.name == "h2":
            break
    print(table)


                        
这篇关于使用bs4来查找具有文本的html标签（h2）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

使用bs4来查找具有文本的html标签（h2） [英] using bs4 to find a html tag (h2) having text

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用bs4来查找具有文本的html标签（h2） [英] using bs4 to find a html tag (h2) having text

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭