Scrapy Files Pipeline 不下载文件 [英] Scrapy Files Pipeline not downloading files
问题描述
我的任务是构建一个网络爬虫,可以下载给定站点中的所有 .pdf
.Spider 在本地机器和抓取集线器上运行.出于某种原因,当我运行它时,它只下载了一些但不是所有的 pdf.这可以通过查看输出 JSON
中的项目看出.
I have been tasked with building a web crawler that downloads all .pdf
s in a given site. Spider runs on local machine and on scraping hub. For some reason when I run it only downloads some but not all of the pdfs. This can be seen by looking at the items in the output JSON
.
我已经设置了 MEDIA_ALLOW_REDIRECTS = True
并尝试在 scrapinghub 和本地运行它
I have set MEDIA_ALLOW_REDIRECTS = True
and tried to run it on scrapinghub as well as locally
这是我的蜘蛛
import scrapy
from scrapy.loader import ItemLoader
from poc_scrapy.items import file_list_Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class PdfCrawler(CrawlSpider):
# loader = ItemLoader(item=file_list_Item())
downloaded_set = {''}
name = 'example'
allowed_domains = ['www.groton.org']
start_urls = ['https://www.groton.org']
rules=(
Rule(LinkExtractor(allow='www.groton.org'), callback='parse_page', follow=True),
)
def parse_page(self, response):
print('parseing' , response)
pdf_urls = []
link_urls = []
other_urls = []
# print("this is the response", response.text)
all_href = response.xpath('/html/body//a/@href').extract()
# classify all links
for href in all_href:
if len(href) < 1:
continue
if href[-4:] == '.pdf':
pdf_urls.append(href)
elif href[0] == '/':
link_urls.append(href)
else:
other_urls.append(href)
# get the links that have pdfs and send them to the item pipline
for pdf in pdf_urls:
if pdf[0:5] != 'http':
new_pdf = response.urljoin(pdf)
if new_pdf in self.downloaded_set:
# we have seen it before, dont do anything
# print('skipping ', new_pdf)
pass
else:
loader = ItemLoader(item=file_list_Item())
# print(self.downloaded_set)
self.downloaded_set.add(new_pdf)
loader.add_value('file_urls', new_pdf)
loader.add_value('base_url', response.url)
yield loader.load_item()
else:
if new_pdf in self.downloaded_set:
pass
else:
loader = ItemLoader(item=file_list_Item())
self.downloaded_set.add(new_pdf)
loader.add_value('file_urls', new_pdf)
loader.add_value('base_url', response.url)
yield loader.load_item()
settings.py
settings.py
MEDIA_ALLOW_REDIRECTS = True
BOT_NAME = 'poc_scrapy'
SPIDER_MODULES = ['poc_scrapy.spiders']
NEWSPIDER_MODULE = 'poc_scrapy.spiders'
ROBOTSTXT_OBEY = True
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,'poc_scrapy.middlewares.UserAgentMiddlewareRotator': 400,
}
ITEM_PIPELINES = {
'scrapy.pipelines.files.FilesPipeline':1
}
FILES_STORE = 'pdfs/'
AUTOTHROTTLE_ENABLED = True
这里是输出的一小部分
{
"file_urls": [
"https://www.groton.org/ftpimages/542/download/download_3402393.pdf"
],
"base_url": [
"https://www.groton.org/parents/business-office"
],
"files": []
},
如您所见,pdf 文件位于 file_urls 中但未下载,有 5 条警告消息表明其中一些文件无法下载,但有 20 多个丢失文件.
as you can see the pdf file is in the file_urls but not downloaded, there are 5 warning messages that indicate that the some of them can not be downloaded but there are over 20 missing files.
这是我收到的一些文件的警告消息
Here is the warning message I get for some of the files
[scrapy.pipelines.files] File (code: 301): Error downloading file from <GET http://groton.myschoolapp.com/ftpimages/542/download/Candidate_Statement_2013.pdf> referred in <None>
[scrapy.core.downloader.handlers.http11] Received more bytes than download warn size (33554432) in request <GET https://groton.myschoolapp.com/ftpimages/542/download/download_1474034.pdf>
.
我希望所有文件都将被下载,或者至少对所有未下载的文件发出警告消息.也许有解决方法.
I would expect that all the files will be download or at least a warning message for all files that are not downloaded. Maybe there is a workaround.
非常感谢任何反馈.谢谢!
Any feedback is greatly appreciated. Thanks!
推荐答案
UPDATE:我意识到问题是 robots.txt 不允许我访问某些 pdf.这可以通过使用其他服务下载它们或不关注 robots.txt 来解决
UPDATE: I realized that the problem was that robots.txt was not allowing me to visit some of the pdfs. This could be fixed by using an other service to download them or by not following robots.txt
这篇关于Scrapy Files Pipeline 不下载文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!