使用 Scrapy 从网站查找和下载 pdf 文件 [英] Using Scrapy to to find and download pdf files from a website

查看：70 发布时间：2021/6/25 20:29:41 python scrapy

本文介绍了使用 Scrapy 从网站查找和下载 pdf 文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的任务是使用 Scrapy 从网站中提取 pdf 文件.我对 Python 并不陌生，但 Scrapy 对我来说很陌生.我一直在试验控制台和一些基本的蜘蛛.我找到并修改了此代码:

I've been tasked with pulling pdf files from websites using Scrapy. I'm not new to Python, but Scrapy is a very new to me. I've been experimenting with the console and a few rudimentary spiders. I've found and modified this code:

import urlparse
import scrapy

from scrapy.http import Request

class pwc_tax(scrapy.Spider):
    name = "pwc_tax"

    allowed_domains = ["www.pwc.com"]
    start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"]

    def parse(self, response):
        base_url = "http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"
        for a in response.xpath('//a[@href]/@href'):
            link = a.extract()
            if link.endswith('.pdf'):
                link = urlparse.urljoin(base_url, link)
                yield Request(link, callback=self.save_pdf)

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        with open(path, 'wb') as f:
            f.write(response.body)

我在命令行中使用

scrapy crawl mySpider

我一无所获.我没有创建scrapy项目，因为我想抓取和下载文件，没有元数据.我将不胜感激.

and I get nothing back. I didn't create a scrapy item because I want to crawl and download the file, no meta data. I would appreciate any help on this.

推荐答案

蜘蛛逻辑似乎不正确.

我快速浏览了您的网站，似乎有几种类型的页面:

I had a quick look at your website, and seems there are several types of pages:

http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html 初始页面
特定文章的网页，例如http://www.pwc.com/us/en/tax-services/publications/insights/australia-introduces-new-foreign-resident-cgt-withholding-regime.html 可从第 1 页导航
实际 PDF 位置，例如http://www.pwc.com/us/en/state-local-tax/newsletters/salt-insights/assets/pwc-wotc-precertification-period-extended-to-june-29.pdf 可以从第 2 页导航

http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html the initial page
Webpages for specific articles, e.g. http://www.pwc.com/us/en/tax-services/publications/insights/australia-introduces-new-foreign-resident-cgt-withholding-regime.html which could be navigated from page #1
Actual PDF locations, e.g. http://www.pwc.com/us/en/state-local-tax/newsletters/salt-insights/assets/pwc-wotc-precertification-period-extended-to-june-29.pdf which could be navigated from page #2

因此正确的逻辑是:先获取#1 页面，然后获取#2 页面，然后我们可以下载那些#3 页面.
但是，您的蜘蛛尝试直接从 #1 页面中提取指向 #3 页面的链接.

Thus the correct logic looks like: get the #1 page first, get #2 pages then, and we could download those #3 pages.
However your spider tries to extract links to #3 pages directly from the #1 page.

已

我已经更新了你的代码，这里有一些实际有效的东西:

I have updated your code, and here's something that actually works:

import urlparse
import scrapy

from scrapy.http import Request

class pwc_tax(scrapy.Spider):
    name = "pwc_tax"

    allowed_domains = ["www.pwc.com"]
    start_urls = ["http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html"]

    def parse(self, response):
        for href in response.css('div#all_results h3 a::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.parse_article
            )

    def parse_article(self, response):
        for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.save_pdf
            )

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        self.logger.info('Saving PDF %s', path)
        with open(path, 'wb') as f:
            f.write(response.body)

这篇关于使用 Scrapy 从网站查找和下载 pdf 文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 Scrapy 从网站查找和下载 pdf 文件 [英] Using Scrapy to to find and download pdf files from a website

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用 Scrapy 从网站查找和下载 pdf 文件 [英] Using Scrapy to to find and download pdf files from a website

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭