Scrapy Files Pipeline 不下载文件 [英] Scrapy Files Pipeline not downloading files

查看:42
本文介绍了Scrapy Files Pipeline 不下载文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的任务是构建一个网络爬虫,可以下载给定站点中的所有 .pdf.Spider 在本地机器和抓取集线器上运行.出于某种原因,当我运行它时,它只下载了一些但不是所有的 pdf.这可以通过查看输出 JSON 中的项目看出.

I have been tasked with building a web crawler that downloads all .pdfs in a given site. Spider runs on local machine and on scraping hub. For some reason when I run it only downloads some but not all of the pdfs. This can be seen by looking at the items in the output JSON.

我已经设置了 MEDIA_ALLOW_REDIRECTS = True 并尝试在 scrapinghub 和本地运行它

I have set MEDIA_ALLOW_REDIRECTS = True and tried to run it on scrapinghub as well as locally

这是我的蜘蛛

import scrapy
from scrapy.loader import ItemLoader
from poc_scrapy.items import file_list_Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class PdfCrawler(CrawlSpider):
    # loader = ItemLoader(item=file_list_Item())
    downloaded_set = {''}
    name = 'example'
    allowed_domains = ['www.groton.org']
    start_urls = ['https://www.groton.org']

    rules=(
        Rule(LinkExtractor(allow='www.groton.org'), callback='parse_page', follow=True),
    )



    def parse_page(self, response):
        print('parseing' , response)
        pdf_urls = []
        link_urls = []
        other_urls = []
        # print("this is the response", response.text)
        all_href = response.xpath('/html/body//a/@href').extract()

        # classify all links
        for href in all_href:
            if len(href) < 1:
                continue
            if href[-4:] == '.pdf':
                pdf_urls.append(href)
            elif href[0] == '/':
                link_urls.append(href)
            else:
                other_urls.append(href)

        # get the links that have pdfs and send them to the item pipline 
        for pdf in pdf_urls:
            if pdf[0:5] != 'http':
                new_pdf = response.urljoin(pdf)

                if new_pdf in self.downloaded_set:
                    # we have seen it before, dont do anything
                    # print('skipping ', new_pdf)
                    pass
                else: 
                    loader = ItemLoader(item=file_list_Item())
                    # print(self.downloaded_set)   
                    self.downloaded_set.add(new_pdf) 
                    loader.add_value('file_urls', new_pdf)
                    loader.add_value('base_url', response.url)
                    yield loader.load_item()
            else:

                if new_pdf in self.downloaded_set:
                    pass
                else:
                    loader = ItemLoader(item=file_list_Item())
                    self.downloaded_set.add(new_pdf) 
                    loader.add_value('file_urls', new_pdf)
                    loader.add_value('base_url', response.url)
                    yield loader.load_item()

settings.py

settings.py

MEDIA_ALLOW_REDIRECTS = True
BOT_NAME = 'poc_scrapy'

SPIDER_MODULES = ['poc_scrapy.spiders']
NEWSPIDER_MODULE = 'poc_scrapy.spiders'

ROBOTSTXT_OBEY = True


DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,'poc_scrapy.middlewares.UserAgentMiddlewareRotator': 400,
}


ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline':1
}
FILES_STORE = 'pdfs/'

AUTOTHROTTLE_ENABLED = True

这里是输出的一小部分

    {
        "file_urls": [
            "https://www.groton.org/ftpimages/542/download/download_3402393.pdf"
        ],
        "base_url": [
            "https://www.groton.org/parents/business-office"
        ],
        "files": []
    },

如您所见,pdf 文件位于 file_urls 中但未下载,有 5 条警告消息表明其中一些文件无法下载,但有 20 多个丢失文件.

as you can see the pdf file is in the file_urls but not downloaded, there are 5 warning messages that indicate that the some of them can not be downloaded but there are over 20 missing files.

这是我收到的一些文件的警告消息

Here is the warning message I get for some of the files

[scrapy.pipelines.files] File (code: 301): Error downloading file from <GET http://groton.myschoolapp.com/ftpimages/542/download/Candidate_Statement_2013.pdf> referred in <None>

[scrapy.core.downloader.handlers.http11] Received more bytes than download warn size (33554432) in request <GET https://groton.myschoolapp.com/ftpimages/542/download/download_1474034.pdf>

.

我希望所有文件都将被下载,或者至少对所有未下载的文件发出警告消息.也许有解决方法.

I would expect that all the files will be download or at least a warning message for all files that are not downloaded. Maybe there is a workaround.

非常感谢任何反馈.谢谢!

Any feedback is greatly appreciated. Thanks!

推荐答案

UPDATE:我意识到问题是 robots.txt 不允许我访问某些 pdf.这可以通过使用其他服务下载它们或不关注 robots.txt 来解决

UPDATE: I realized that the problem was that robots.txt was not allowing me to visit some of the pdfs. This could be fixed by using an other service to download them or by not following robots.txt

这篇关于Scrapy Files Pipeline 不下载文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆