在网页抓取时访问网页上提供的链接上的内容 [英] Accessing the contents on links provided on a webpage while webscraping

查看:29
本文介绍了在网页抓取时访问网页上提供的链接上的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的的后续问题上一个问题.我正在尝试访问 网页 的内容.

This is a followup question of my previous question. I am trying to access the contents of a webpage.

我可以在网页上搜索内容.但是,我不确定如何访问网页上给出的链接中的内容.

I could search for contents on the webpage. However, I am not sure how to access the contents in links given on the webpage.

例如,id 1.1.1.1 的搜索结果的第一行是36EUL/ADL_7 1.1.1.1 分光光度法....C ....

For instance, the first line of the search result for id 1.1.1.1 is 36EUL/ADL_7 1.1.1.1 spectrophotometry .... C ....

第一行中的辅助 ID 36EUL/ADL_7 有另一个链接,点击后会打开.

The secondary id 36EUL/ADL_7, in the first line, has another link that opens when clicked.

我不知道如何访问辅助ID的搜索结果的内容.

I am not sure how to access the contents of the search result of the secondary id.

有什么建议吗?

Sers 发布的解决方案适用于 search_term = 1.1.1.1 具有以下输出格式(与获得的输出相同)

The solution posted by Sers works for search_term = 1.1.1.1 with the following output format( same as the obtained output)

EC Number: 1.1.1.1, Reference Id: 36EUL/ADL_7, Evaluation: C
T(K): 298.15, pH: 6.4, K': 1.3E-5
T(K): 298.15, pH: 7.0, K': 5.3E-5
T(K): 298.15, pH: 7.7, K': 1.3E-4

但是,对于不同的搜索词,即search_term = 2.7.2.3

However, for a different search term, i.e. search_term = 2.7.2.3

获得的输出:(失败,因为数据库中的输出表有4列,不包括引用id)

Output obtained: (fails because the output table in the database has 4 columns excluding reference id)

EC Number: 2.7.2.3, Reference Id: 95SCH/TRA_581, Evaluation: C
T(K): 277.15, pH: 7.5, K': ethylene glycol, 40 %
T(K): 277.15, pH: 7.5, K': none

预期输出:

EC Number: 2.7.2.3, Reference Id: 95SCH/TRA_581, Evaluation: C
T(K): 277.15, pH: 7.5, cosolvent: ethylene glycol, 40 %, K':8.0E-5
T(K): 277.15, pH: 7.5, cosolvent: none, K':1.5E-4

第 85-87 行,并不总是正确的分配.

lines 85-87 , isn't the correct assignment always.

tk_list = page.select("#MainBody_extraData td:nth-child(1)")
ph_list = page.select("#MainBody_extraData td:nth-child(2)")
k_list = page.select("#MainBody_extraData td:nth-child(3)")

我的建议是,

我们可以在解析列中的值的同时映射列名和对应的值吗?

Can we map the column name and the corresponding values while parsing the values from columns?

<table bgcolor="White" bordercolor="White" cellpadding="3" cellspacing="1" id="MainBody_extraData" width="100%">
<tr bgcolor="#4A3C8C">
<th scope="col"><font color="#E7E7FF"><b>T(K)</b></font></th><th scope="col"><font color="#E7E7FF"><b>pH </b></font></th><th scope="col"><font color="#E7E7FF"><b>cosolvent </b></font></th><th scope="col"><font color="#E7E7FF"><b>K' </b></font></th><th scope="col"><font color="#E7E7FF"><b>95SCH/TRA_581</b></font></th>
</tr><tr bgcolor="#DEDFDE">
<td><font color="Black">277.15</font></td><td><font color="Black">7.5</font></td><td><font color="Black">ethylene glycol, 40 %</font></td><td><font color="Black">8.0E-5</font></td><td><font color="Black">95SCH/TRA_581</font></td>
</tr><tr bgcolor="White">
<td><font color="Black">277.15</font></td><td><font color="Black">7.5</font></td><td><font color="Black">none</font></td><td><font color="Black">1.5E-4</font></td><td><font color="Black">95SCH/TRA_581</font></td>
</tr>
</table>

推荐答案

所有这些都可以使用 RequestsBeautifulSoup 完成,无需 Selenium.这里代码如何获取带有详细信息的数据:

All can be done using Requests and BeautifulSoup without Selenium. Here code how to get data with details:

import requests
from bs4 import BeautifulSoup

base_url = 'https://randr.nist.gov'
ec_name = 'enzyme'
search_term = '1.1.1.1'

url = f'{base_url}/{ec_name}/'

with requests.Session() as session:
    # get __VIEWSTATE, __VIEWSTATEGENERATOR, __EVENTVALIDATION parameters to use them in POST parameters
    response = session.get(url)
    page = BeautifulSoup(response.text, "html.parser")
    view_state = page.find(id="__VIEWSTATE")["value"]
    view_state_generator = page.find(id="__VIEWSTATEGENERATOR")["value"]
    event_validation = page.find(id="__EVENTVALIDATION")["value"]

    data = {
        '__EVENTTARGET': '',
        '__EVENTARGUMENT': '',
        '__LASTFOCUS': '',
        '__VIEWSTATE': view_state,
        '__VIEWSTATEGENERATOR': view_state_generator,
        '__SCROLLPOSITIONX': '0',
        '__SCROLLPOSITIONY': '0',
        '__EVENTVALIDATION': event_validation,
        'ctl00$MainBody$txtSrchAutoFill': search_term,
        'ctl00$MainBody$repoList': 'Enzyme_thermo',
        'ctl00$MainBody$ImgSrch.x': '0',
        'ctl00$MainBody$ImgSrch.y': '0'
    }
    response = session.post(url, data=data)
    page = BeautifulSoup(response.text, "html.parser")

    # get all rows
    rows = page.select("#MainBody_gvSearch tr")
    # first row is header, remove it
    rows.remove(rows[0])

    for row in rows:
        reference_id = row.select_one("[id*='lbSearch']").text.strip()
        ec_number = row.select_one("[id*='lblECNumber']").text.strip()
        method = row.select_one("[id*='lblMethod']").text.strip()
        buffer = row.select_one("[id*='lblBuffer']").text.strip()
        reaction = row.select_one("[id*='lblReaction']").text.strip()
        enzyme = row.select_one("[id*='lblEnzyme']").text.strip()
        cofactor = row.select_one("[id*='lblCofactor']").text.strip()
        evaluation = row.select_one("[id*='lblEvaluation']").text.strip()

        print(f"EC Number: {ec_number}, Reference Id: {reference_id}, Evaluation: {evaluation}")

        # get details
        params = (
            ('ID', reference_id),
            ('finalterm', search_term),
            ('data', ec_name),
        )
        response = session.get('https://randr.nist.gov/enzyme/DataDetails.aspx', params=params)
        page = BeautifulSoup(response.text, "html.parser")

        # parse general information
        if page.find("span", text='Reference:'):
            reference = page.find("span", text='Reference:').find_parent("td").find_next_sibling("td").text.strip()
        if page.find("span", text='pH:'):
            ph = page.find("span", text='pH:').find_parent("td").find_next_sibling("td").text.strip()

        # parse table
        extra_data = []
        try:
            table_headers = [x.text.strip() for x in page.select("#MainBody_extraData th")]
            table_data = [x.text.strip() for x in page.select("#MainBody_extraData td")]

            headers_count = len(table_headers)
            for i in range(0, len(table_data), headers_count):
                row = {}
                row_data = table_data[i:i + headers_count]
                for column_index, h in enumerate(table_headers):
                    row[h] = row_data[column_index]

                print("T(K): {}, pH: {}, K': {}".format(row["T(K)"], row["pH"], row["K'"]))
                extra_data.append(row)

        except Exception as ex:
            print("No details table found")
            print(ex)

        print("")

输出一些值:

EC 编号:1.1.1.1,参考 ID:36EUL/ADL_7,评估:C
T(K):298.15,pH:6.4,K':1.3E-5
T(K):298.15,pH:7.0,K':5.3E-5
T(K):298.15,pH:7.7,K':1.3E-4

EC 编号:1.1.1.1,参考 ID:37ADL/SRE_8,评估:D
T(K):298.15,pH:6.05,K':6.0E-6
T(K):298.15,pH:7.25,K':7.7E-5
T(K):298.15,pH:8.0,K':1.2E-5

EC 编号:1.1.1.1,参考 ID:37NEG/WUL_9,评估:C
T(K):293.15,pH:7.9,K':7.41E-4

EC 编号:1.1.1.1,参考 ID:38SCH/HEL_10,评估:C
T(K):298.15,pH:6.30,K':2.6E-5
T(K):298.15,pH:6.85,K':8.8E-5
T(K):298.15,pH:7.15,K':1.9E-4
T(K):298.15,pH:7.34,K':3.0E-4
T(K):298.15,pH:7.61,K':5.1E-4
T(K):298.15,pH:7.77,K':8.0E-4
T(K):298.15,pH:8.17,K':2.2E-3

EC 编号:1.1.1.1,参考 ID:38SCH/HEL_23,评估:C
T(K):298.15,pH:6.39,K':9.1E-6
T(K):298.15,pH:6.60,K':3.0E-5
T(K):298.15,pH:6.85,K':5.1E-5
T(K):298.15,pH:7.18,K':1.5E-4
T(K):298.15,pH:7.31,K':2.3E-4
T(K):298.15,pH:7.69,K':5.6E-4
T(K):298.15,pH:8.06,K':1.1E-3

EC Number: 1.1.1.1, Reference Id: 36EUL/ADL_7, Evaluation: C
T(K): 298.15, pH: 6.4, K': 1.3E-5
T(K): 298.15, pH: 7.0, K': 5.3E-5
T(K): 298.15, pH: 7.7, K': 1.3E-4

EC Number: 1.1.1.1, Reference Id: 37ADL/SRE_8, Evaluation: D
T(K): 298.15, pH: 6.05, K': 6.0E-6
T(K): 298.15, pH: 7.25, K': 7.7E-5
T(K): 298.15, pH: 8.0, K': 1.2E-5

EC Number: 1.1.1.1, Reference Id: 37NEG/WUL_9, Evaluation: C
T(K): 293.15, pH: 7.9, K': 7.41E-4

EC Number: 1.1.1.1, Reference Id: 38SCH/HEL_10, Evaluation: C
T(K): 298.15, pH: 6.30, K': 2.6E-5
T(K): 298.15, pH: 6.85, K': 8.8E-5
T(K): 298.15, pH: 7.15, K': 1.9E-4
T(K): 298.15, pH: 7.34, K': 3.0E-4
T(K): 298.15, pH: 7.61, K': 5.1E-4
T(K): 298.15, pH: 7.77, K': 8.0E-4
T(K): 298.15, pH: 8.17, K': 2.2E-3

EC Number: 1.1.1.1, Reference Id: 38SCH/HEL_23, Evaluation: C
T(K): 298.15, pH: 6.39, K': 9.1E-6
T(K): 298.15, pH: 6.60, K': 3.0E-5
T(K): 298.15, pH: 6.85, K': 5.1E-5
T(K): 298.15, pH: 7.18, K': 1.5E-4
T(K): 298.15, pH: 7.31, K': 2.3E-4
T(K): 298.15, pH: 7.69, K': 5.6E-4
T(K): 298.15, pH: 8.06, K': 1.1E-3

这篇关于在网页抓取时访问网页上提供的链接上的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆