使用 Scrapy:如何从一些提取的链接下载 pdf 文件 [英] Using Scrapy: how to download pdf files from some extracted links
问题描述
我已经创建了从网站(PDF 链接)中提取一些链接的代码,现在我需要下载这些 PDF 文件,但我正在为如何做到这一点而苦苦挣扎.这是代码:
I've created code that extracts some links from a website (PDF links,) and now I need to download these PDF files, but I am struggling with how to do that. This is the code:
import scrapy
class all5(scrapy.Spider):
name = "all5"
start_urls = [
'https://www.alloschool.com/course/alriadhiat-alaol-ibtdaii',
]
def parse(self, response):
for link in response.css('.default .er').xpath('@href').extract():
url=response.url
path=response.css('ol.breadcrumb li a::text').extract()
next_link = response.urljoin(link)
yield scrapy.Request(next_link,callback=self.parse_det,meta={'url' : url,'path':path})
def parse_det(self, response):
def extract_with_css(query):
return response.css(query).get(default='').strip()
yield {
'path':response.meta['path'],
'finallink': extract_with_css('a.btn.btn-primary::attr(href)'),
'url':response.meta['url']
}
我需要下载的链接是finallink".
The links that I need to download are "finallink".
我该如何解决问题?
推荐答案
在设置中你必须激活管道
In settings you have to activate pipeline
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
和下载文件的文件夹
'FILES_STORE' = '.'
它将下载到 FILES_STORE/full
当你产生数据时,你必须使用名称files_url
And you have to use name files_url
when you yield data
yield {
'file_urls': [extract_with_css('a.btn.btn-primary::attr(href)')]
# ... rest ...
}
即使你有一个文件要下载,它也必须是列表.
It has to be list even if you has one file to download.
它应该将 PDF 下载到您在 files
字段中的数据中获得的具有唯一名称的文件中
It should download PDFs to files with unique names which you get in data in field files
Scrapy 文档:下载和处理文件和图像
独立代码 - 您无需创建项目即可复制和运行.
standalone code - you can copy and run without creating project.
#!/usr/bin/env python3
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = [
'https://www.alloschool.com/course/alriadhiat-alaol-ibtdaii',
]
def parse(self, response):
for link in response.css('.default .er').xpath('@href').extract():
url = response.url
path = response.css('ol.breadcrumb li a::text').extract()
next_link = response.urljoin(link)
yield scrapy.Request(next_link, callback=self.parse_det, meta={'url': url, 'path': path})
def parse_det(self, response):
def extract_with_css(query):
return response.css(query).get(default='').strip()
yield {
'path':response.meta['path'],
'file_urls': [extract_with_css('a.btn.btn-primary::attr(href)')],
'url':response.meta['url']
}
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file as CSV, JSON or XML
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'output.csv', #
# download files to `FILES_STORE/full`
# it needs `yield {'file_urls': [url]}` in `parse()`
'ITEM_PIPELINES': {'scrapy.pipelines.files.FilesPipeline': 1},
'FILES_STORE': '.',
})
c.crawl(MySpider)
c.start()
这篇关于使用 Scrapy:如何从一些提取的链接下载 pdf 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!