使用Python 3.6访问隐藏的选项卡和Web抓取 [英] Accessing Hidden Tabs, Web Scraping With Python 3.6

查看:136
本文介绍了使用Python 3.6访问隐藏的选项卡和Web抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在python 3.6中使用bs4和urllib.request进行webscrape.我必须打开标签页/才能在按钮标签页中切换"aria-expanded",才能访问所需的div标签页.

I'm using bs4 and urllib.request in python 3.6 to webscrape. I have to open tabs / be able to toggle an "aria-expanded" in button tabs in order to access the div tabs I need.

关闭选项卡时的按钮选项卡如下,用<>代替-:

The button tab when the tab is closed is as follows with <> instead of --:

button id ="0-accordion-tab-0" type ="button" class ="accordion-panel-title u-padding-ver-s u-text-left text-l js-accordion-panel-title "aria-controls =" 0-accordion-panel-0"aria-expanded =" false"

button id="0-accordion-tab-0" type="button" class="accordion-panel-title u-padding-ver-s u-text-left text-l js-accordion-panel-title" aria-controls="0-accordion-panel-0" aria-expanded="false"

打开后,aria-expanded ="true"和div选项卡显示在下面.

When opened, the aria-expanded="true" and the div tab appears underneath.

关于如何执行此操作的任何想法?

Any idea on how to do this?

我们将非常感谢您的帮助.

Help would be super appreciated.

推荐答案

在您的其他帖子中,我猜测URL为

From your other post I'm guessing the URL is https://www.sciencedirect.com/journal/construction-and-building-materials/issues

单击链接时,网页将从另一个URL加载JSON.您可以自己请求JSON,而无需单击链接.您需要知道的是永不更改的ISBN(09500618),以及您可以从某个范围传入的年份.甚至会从已经展开的标签中返回数据.

The web-page loads JSON from another URL when you click the link. You can request the JSON yourself without the need to click the link. All you need to know is the ISBN which never changes (09500618) and the year which you can pass in from a range. This even returns data from the tabs that are already expanded.

import requests
import json

# The website rejects requests except from user agents it has not blacklisted so set a header
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0'
}

for i in range (1999, 2019):
    url = "https://www.sciencedirect.com/journal/09500618/year/" + str(i) + "/issues"
    r = requests.get(url, headers=headers)
    j = r.json()

    for d in j['data']:
        # Print the json object
        print (json.dumps(d, indent=4, sort_keys=True))
        # Or print specific values
        print (d['coverDateText'], d['volumeFirst'], d['uriLookup'], d['srctitle'])

输出:

{
    "cid": "271475",
    "contentFamily": "serial",
    "contentType": "JL",
    "coverDateStart": "19991201",
    "coverDateText": "1 December 1999",
    "hubStage": "H300",
    "issn": "09500618",
    "issueFirst": "8",
    "pages": [
        {
            "firstPage": "417",
            "lastPage": "470"
        }
    ],
    "pii": "S0950061800X00323",
    "sortField": "1999001300008zzzzzzz",
    "srctitle": "Construction and Building Materials",
    "uriLookup": "/vol/13/issue/8",
    "volIssueSupplementText": "Volume 13, Issue 8",
    "volumeFirst": "13"
}
1 December 1999 13 /vol/13/issue/8 Construction and Building Materials
...

这篇关于使用Python 3.6访问隐藏的选项卡和Web抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆