Scrapy Files Pipeline 不下载文件 [英] Scrapy Files Pipeline not downloading files

查看：42 发布时间：2021/7/17 18:37:05 python web-scraping scrapy

本文介绍了Scrapy Files Pipeline 不下载文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的任务是构建一个网络爬虫，可以下载给定站点中的所有 .pdf.Spider 在本地机器和抓取集线器上运行.出于某种原因，当我运行它时，它只下载了一些但不是所有的 pdf.这可以通过查看输出 JSON 中的项目看出.

I have been tasked with building a web crawler that downloads all .pdfs in a given site. Spider runs on local machine and on scraping hub. For some reason when I run it only downloads some but not all of the pdfs. This can be seen by looking at the items in the output JSON.

我已经设置了 MEDIA_ALLOW_REDIRECTS = True 并尝试在 scrapinghub 和本地运行它

I have set MEDIA_ALLOW_REDIRECTS = True and tried to run it on scrapinghub as well as locally

这是我的蜘蛛

import scrapy
from scrapy.loader import ItemLoader
from poc_scrapy.items import file_list_Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class PdfCrawler(CrawlSpider):
    # loader = ItemLoader(item=file_list_Item())
    downloaded_set = {''}
    name = 'example'
    allowed_domains = ['www.groton.org']
    start_urls = ['https://www.groton.org']

    rules=(
        Rule(LinkExtractor(allow='www.groton.org'), callback='parse_page', follow=True),
    )



    def parse_page(self, response):
        print('parseing' , response)
        pdf_urls = []
        link_urls = []
        other_urls = []
        # print("this is the response", response.text)
        all_href = response.xpath('/html/body//a/@href').extract()

        # classify all links
        for href in all_href:
            if len(href) < 1:
                continue
            if href[-4:] == '.pdf':
                pdf_urls.append(href)
            elif href[0] == '/':
                link_urls.append(href)
            else:
                other_urls.append(href)

        # get the links that have pdfs and send them to the item pipline 
        for pdf in pdf_urls:
            if pdf[0:5] != 'http':
                new_pdf = response.urljoin(pdf)

                if new_pdf in self.downloaded_set:
                    # we have seen it before, dont do anything
                    # print('skipping ', new_pdf)
                    pass
                else: 
                    loader = ItemLoader(item=file_list_Item())
                    # print(self.downloaded_set)   
                    self.downloaded_set.add(new_pdf) 
                    loader.add_value('file_urls', new_pdf)
                    loader.add_value('base_url', response.url)
                    yield loader.load_item()
            else:

                if new_pdf in self.downloaded_set:
                    pass
                else:
                    loader = ItemLoader(item=file_list_Item())
                    self.downloaded_set.add(new_pdf) 
                    loader.add_value('file_urls', new_pdf)
                    loader.add_value('base_url', response.url)
                    yield loader.load_item()

settings.py

MEDIA_ALLOW_REDIRECTS = True
BOT_NAME = 'poc_scrapy'

SPIDER_MODULES = ['poc_scrapy.spiders']
NEWSPIDER_MODULE = 'poc_scrapy.spiders'

ROBOTSTXT_OBEY = True


DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,'poc_scrapy.middlewares.UserAgentMiddlewareRotator': 400,
}


ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline':1
}
FILES_STORE = 'pdfs/'

AUTOTHROTTLE_ENABLED = True

这里是输出的一小部分

    {
        "file_urls": [
            "https://www.groton.org/ftpimages/542/download/download_3402393.pdf"
        ],
        "base_url": [
            "https://www.groton.org/parents/business-office"
        ],
        "files": []
    },

如您所见，pdf 文件位于 file_urls 中但未下载，有 5 条警告消息表明其中一些文件无法下载，但有 20 多个丢失文件.

as you can see the pdf file is in the file_urls but not downloaded, there are 5 warning messages that indicate that the some of them can not be downloaded but there are over 20 missing files.

这是我收到的一些文件的警告消息

Here is the warning message I get for some of the files

[scrapy.pipelines.files] File (code: 301): Error downloading file from <GET http://groton.myschoolapp.com/ftpimages/542/download/Candidate_Statement_2013.pdf> referred in <None>

[scrapy.core.downloader.handlers.http11] Received more bytes than download warn size (33554432) in request <GET https://groton.myschoolapp.com/ftpimages/542/download/download_1474034.pdf>

我希望所有文件都将被下载，或者至少对所有未下载的文件发出警告消息.也许有解决方法.

I would expect that all the files will be download or at least a warning message for all files that are not downloaded. Maybe there is a workaround.

非常感谢任何反馈.谢谢！

Any feedback is greatly appreciated. Thanks!

Scrapy Files Pipeline 不下载文件 [英] Scrapy Files Pipeline not downloading files

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy Files Pipeline 不下载文件 [英] Scrapy Files Pipeline not downloading files

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭