通过在python中爬取子URL来下载文件 [英] Downloading files by crawling sub-URLs in python
问题描述
我正在尝试从如下所示的大量网络链接下载文档(主要是 pdf):
I am trying to download documents (mainly in pdf) from a large number of web links like the following:
https://projects.worldbank.org/en/projects-operations/document-detail/P167897?type=projects
https://projects.worldbank.org/en/projects-operations/document-detail/P173997?type=projects
https://projects.worldbank.org/en/projects-operations/document-detail/P166309?type=projects
但是,无法从这些链接直接访问 pdf 文件.需要单击子 URL 才能访问 pdf.有没有办法抓取子网址并从中下载所有相关文件?我正在尝试使用以下代码,但到目前为止还没有任何成功,专门针对此处列出的这些 URL.
However, the pdf files are not directly accessible from these links. One needs to click on sub-URLs to access the pdfs. Is there any way to crawl the sub-URLs and download all the related files from them? I am trying it with the following codes but have not had any success so far specifically for these URLs listed here.
如果您需要任何进一步的说明,请告诉我.我很乐意这样做.谢谢.
Please let me know if you need any further clarifications. I would be happy to do so. Thank you.
from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain, utils
class MySpider(Spider):
name = 'download_pdf'
allowed_domains = ["www.worldbank.org"]
start_urls = [
"https://projects.worldbank.org/en/projects-operations/document-detail/P167897?type=projects",
"https://projects.worldbank.org/en/projects-operations/document-detail/P173997?type=projects",
"https://projects.worldbank.org/en/projects-operations/document-detail/P166309?type=projects"
] # Entry page
def afterResponse(self, response, url, error=None, extra=None):
if not extra:
print ("The version of library simplified_scrapy is too old, please update.")
SimplifiedMain.setRunFlag(False)
return
try:
path = './pdfs'
# create folder start
srcUrl = extra.get('srcUrl')
if srcUrl:
index = srcUrl.find('year/')
year = ''
if index > 0:
year = srcUrl[index + 5:]
index = year.find('?')
if index>0:
path = path + year[:index]
utils.createDir(path)
# create folder end
path = path + url[url.rindex('/'):]
index = path.find('?')
if index > 0: path = path[:index]
flag = utils.saveResponseAsFile(response, path, fileType="pdf")
if flag:
return None
else: # If it's not a pdf, leave it to the frame
return Spider.afterResponse(self, response, url, error, extra)
except Exception as err:
print(err)
def extract(self, url, html, models, modelNames):
doc = SimplifiedDoc(html)
lst = doc.selects('div.list >a').contains("documents/", attr="href")
if not lst:
lst = doc.selects('div.hidden-md hidden-lg >a')
urls = []
for a in lst:
a["url"] = utils.absoluteUrl(url.url, a["href"])
# Set root url start
a["srcUrl"] = url.get('srcUrl')
if not a['srcUrl']:
a["srcUrl"] = url.url
# Set root url end
urls.append(a)
return {"Urls": urls}
# Download again by resetting the URL. Called when you want to download again.
def resetUrl(self):
Spider.clearUrl(self)
Spider.resetUrlsTest(self)
SimplifiedMain.startThread(MySpider()) # Start download
推荐答案
有一个 API 端点,其中包含您在网站上看到的整个响应以及...文档的 URL pdf
.:D
There's an API endpoint that contains the entire response you see on the web-site along with... the URL to the document pdf
. :D
因此,您可以查询 API,获取 URL,最后获取文档.
So, you can query the API, get the URLS, and finally fetch the documents.
方法如下:
import requests
pids = ["P167897", "P173997", "P166309"]
for pid in pids:
end_point = f"https://search.worldbank.org/api/v2/wds?" \
f"format=json&includepublicdocs=1&" \
f"fl=docna,lang,docty,repnb,docdt,doc_authr,available_in&" \
f"os=0&rows=20&proid={pid}&apilang=en"
documents = requests.get(end_point).json()["documents"]
for document_data in documents.values():
try:
pdf_url = document_data["pdfurl"]
print(f"Fetching: {pdf_url}")
with open(pdf_url.rsplit("/")[-1], "wb") as pdf:
pdf.write(requests.get(pdf_url).content)
except KeyError:
continue
输出:(完全下载的 .pdf 文件)
Output: (fully downloaded .pdf files)
Fetching: http://documents.worldbank.org/curated/en/106981614570591392/pdf/Official-Documents-Grant-Agreement-for-Additional-Financing-Grant-TF0B4694.pdf
Fetching: http://documents.worldbank.org/curated/en/331341614570579132/pdf/Official-Documents-First-Restatement-to-the-Disbursement-Letter-for-Grant-D6810-SL-and-for-Additional-Financing-Grant-TF0B4694.pdf
Fetching: http://documents.worldbank.org/curated/en/387211614570564353/pdf/Official-Documents-Amendment-to-the-Financing-Agreement-for-Grant-D6810-SL.pdf
Fetching: http://documents.worldbank.org/curated/en/799541612993594209/pdf/Sierra-Leone-AFRICA-WEST-P167897-Sierra-Leone-Free-Education-Project-Procurement-Plan.pdf
Fetching: http://documents.worldbank.org/curated/en/310641612199201329/pdf/Disclosable-Version-of-the-ISR-Sierra-Leone-Free-Education-Project-P167897-Sequence-No-02.pdf
and more ...
这篇关于通过在python中爬取子URL来下载文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!