访问特定的表中的HTML标记 [英] Access to a specific table in html tag

查看:120
本文介绍了访问特定的表中的HTML标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要使用beautifulsoup地发现,在内容的逻辑定义在以下链接定义的一个表:

I am going to use beautifulsoup to find a table that defined in the "content logical definition" in the following links:

1) https://www.hl7.org/fhir/valueset-account-status.html
2) https://www.hl7.org/fhir/valueset-activity-reason.html
3) https://www.hl7.org/fhir/valueset-age-units.html 

若干表可以在网页中定义。我想该表位于< H2>文本内容的逻辑定义标签。有些网页可能内容逻辑定义部分中没有任何表,所以我想表为空。现在我尝试了几种解决方案,但他们每个人都返回错误表中的某些页面。

Several tables may be defined in the pages. The table I want is located under <h2> tag with text "content logical definition". Some of the pages may lack of any table in the "content logical definition" section, so I want the table to be null. By now I tried several solution, but each of them return wrong table for some of the pages.

这是由alecxe提供的最后一种解决方案是这样的:

The last solution that was offered by alecxe is this:

import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.hl7.org/fhir/valueset-activity-reason.html',
    'https://www.hl7.org/fhir/valueset-age-units.html'
]

for url in urls:
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')

    h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
    table = None
    for sibling in h2.find_next_siblings():
        if sibling.name == "table":
            table = sibling
            break
        if sibling.name == "h2":
            break
    print(table)

此解决方案,如果没有表是存放在内容的逻辑定义的部分,但对于具有内容的逻辑定义表中的第二个URL返回null,则返回错误的表,在页面的最后一个表。

我怎么可以编辑该code访问内容的逻辑定义具有标签文本之后精确定义的表,如果没有台本节则返回null。

This solution returns null if no table is located in the section of "content logical definition" but for the second url having table in "content logical definition" it returns wrong table, a table at the end of the page.
How can I edit this code to access a table defined exactly after tag having text of "content logical definition", and if there is no table in this section it returns null.

推荐答案

它看起来像alecxe的code中的问题是,它返回一个表,它是H2的直接兄弟姐妹,但你想要的其实是内一个div(这是H2的兄弟姐妹)。这为我工作:

It looks like the problem with alecxe's code is that it returns a table that is a direct sibling of h2, but the one you want is actually within a div (which is h2's sibling). This worked for me:

import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.hl7.org/fhir/valueset-account-status.html',
    'https://www.hl7.org/fhir/valueset-activity-reason.html',
    'https://www.hl7.org/fhir/valueset-age-units.html'
]


def extract_table(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')

    h2 = soup.find(lambda elm: elm.name == 'h2' and 'Content Logical Definition' in elm.text)
    div = h2.find_next_sibling('div')
    return div.find('table')


for url in urls:
    print extract_table(url)

这篇关于访问特定的表中的HTML标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆