使用下拉列表从 url 抓取 csv 文件? [英] Crawling csv files from a url with dropdown list?

查看:27
本文介绍了使用下拉列表从 url 抓取 csv 文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 我想从python中所有可用的月份/年份下载CSV格式的所有数据文件(使用beautifulsoup 4).

我试图修改另一个问题中的一些代码这里,但没有成功.请帮忙.从 bs4 导入 BeautifulSoup# Python 3.x从 urllib.request 导入 urlopen, urlretrieve

# 从 URL 中删除尾随/urlJan2020 ='''https://climate.weather.gc.ca/climate_data/hourly_data_e.html?hlyRange=2004-09-24%7C2020-03-03&dlyRange=2018-05-14%7C2020-03-03&m=%7C&StationID=43403&Prov=NS&urlExtension=_e.html&searchType=stnProx&optLimit=yearRange&StartYear=1840&EndYear=2020&selRowPerPage=25&Line=0&t0selCity=44%7C40%7C63%7C36%7CHalifax&selPark=&txtCentralLatDeg=&txtCentralLatMin=0&txtCentralLatSec=0&txtCentralLongDeg=&txtCentralLongDeg=&txtCentralLongMin&selPark=&txtCentralLatDeg#1&Year=2020&Month=1&Day=1#'''你 = urlopen(urlJan2020)尝试:html = u.read().decode('utf-8')最后:u.close()汤 = BeautifulSoup(html, "html.parser")# 选择所有具有 href 属性的 A 元素,以 http://开头对于soup.select('a[href^="http://"]') 中的链接:href = link.get('href')如果没有(href.endswith(x) for x in ['.csv','.xls','.xlsx']):继续文件名 = href.rsplit('/', 1)[-1]# 你不需要 join + quote 因为 HTML 中的 URL 是绝对的.# 但是,我们需要一个 https://URL(尽管链接说的是:在您的 Web 浏览器的开发人员工具中检查请求)href = href.replace('http://','https://')print("正在将 %s 下载到 %s..." % (href, 文件名) )urlretrieve(href, 文件名)打印(完成.")

解决方案

from bs4 import BeautifulSoup进口请求定义主():使用 requests.Session() 作为请求:对于范围内的年份(2019 年、2021 年):对于范围(1, 13)中的月份:r = req.post(f"https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=43403&Year={year}&Month={month}&Day=1&timeframe=1&提交=下载+数据")名称 = r.headers.get("内容处置").split("_", 5)[-1][:-1]with open(name, 'w') as f:f.write(r.text)打印(f保存{名称}")主要的()

I am trying to crawl monthly data (csv files) from Weather Canada.

Normally one needs to select the year/month/day from the dropdown list and click on the "GO" and then click the "Download Data" button for that data of the selected month + year, as below. I'd like to download all data files in CSV from all available month/year in python (with beautifulsoup 4).

I tried to modify some codes from another question here, but hasn't been successful. Please help. from bs4 import BeautifulSoup # Python 3.x from urllib.request import urlopen, urlretrieve

# Removed the trailing / from the URL
urlJan2020 = 
'''https://climate.weather.gc.ca/climate_data/hourly_data_e.html?hlyRange=2004-09-24%7C2020-03-03&dlyRange=2018-05-14%7C2020-03-03&mlyRange=%7C&StationID=43403&Prov=NS&urlExtension=_e.html&searchType=stnProx&optLimit=yearRange&StartYear=1840&EndYear=2020&selRowPerPage=25&Line=0&txtRadius=50&optProxType=city&selCity=44%7C40%7C63%7C36%7CHalifax&selPark=&txtCentralLatDeg=&txtCentralLatMin=0&txtCentralLatSec=0&txtCentralLongDeg=&txtCentralLongMin=0&txtCentralLongSec=0&txtLatDecDeg=&txtLongDecDeg=&timeframe=1&Year=2020&Month=1&Day=1#'''
u = urlopen(urlJan2020)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")

# Select all A elements that have an href attribute, starting with http://
for link in soup.select('a[href^="http://"]'):
    href = link.get('href')
    if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
        continue

    filename = href.rsplit('/', 1)[-1]

    # You don't need to join + quote as URLs in the HTML are absolute.
    # However, we need a https:// URL (in spite of what the link says: check request in your web browser's developer tools)
    href = href.replace('http://','https://')

    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")

解决方案

from bs4 import BeautifulSoup
import requests


def Main():
    with requests.Session() as req:
        for year in range(2019, 2021):
            for month in range(1, 13):
                r = req.post(
                    f"https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=43403&Year={year}&Month={month}&Day=1&timeframe=1&submit=Download+Data")
                name = r.headers.get(
                    "Content-Disposition").split("_", 5)[-1][:-1]
                with open(name, 'w') as f:
                    f.write(r.text)
                print(f"Saved {name}")


Main()

这篇关于使用下拉列表从 url 抓取 csv 文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆