使用 Scrapy 在 .asp 网站上下载所有 pdf 文件时出现问题 [英] Having issue while downloading all pdf files on .asp website using Scrapy

查看:41
本文介绍了使用 Scrapy 在 .asp 网站上下载所有 pdf 文件时出现问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用 Scrapy 在 .asp 网站上下载多个 pdf 文件时遇到问题.这是网站的 URL:https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx.

I am having an issue while downloading multiple pdf files on .asp website using Scrapy. This is the URL of the website: https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx.

现在,如果您访问该网站,它会向上述相同的 URL 发送多个表单请求,并为同一页面生成新更新的 HTML 内容.现在,我已经完成了包括解决CAPTCHA在内的每一步,终于到了可以下载pdf的最后一步.

Now, if you go through the website, it sends multiple form request to the same above URL and generated the newly updated HTML content for the same page. Now, I have gone through every step, including solving the CAPTCHA and finally, I have arrived at the final step where pdfs can be downloaded.

当您填写所有表单详细信息(包括 CAPTCHA)时,您将看到多个链接以下载相同数量的唯一 pdf 文件.这就是我遇到问题的地方.

When you fill all the form details, including CAPTCHA, you'll get to see more than one links to download the same numbers of unique pdf files. And this is where I am having the issue.

现在,当您点击任何链接时,它会向上述 URL 发送一个 POST 请求,并使用以下 javascript 内容刷新页面.

Now, when you click on any links, it sends one POST request to same above URL and refreshes the page with the following javascript content.

<script type="text/javascript">
//<![CDATA[
window.open('ViewRoll.aspx');//]]>
</script>

上面的代码打开另一个标签,网址为 https://ceo.maharashtra.gov.in/searchlist/ViewRoll.aspx,在标签中显示 pdf.我想下载这个pdf文件.

And this above code, opens the another tab with the url https://ceo.maharashtra.gov.in/searchlist/ViewRoll.aspx which shows pdf in the tab. And I want to download this pdf file.

到目前为止,我可以使用 Scrapy 下载单个 pdf 文件而没有问题.但我遇到的问题是下载多个 pdf 文件.有时,我下面的代码会下载相同的 pdf 文件两次,有时它只下载一个 pdf 文件.但每次,如果不是所有其他 pdf 文件,它至少下载一个 pdf 文件.

So far, I am able to download a single pdf file with no issues using Scrapy. But the issue I have is downloading more than one pdf files. Sometimes, my below code download same pdf file twice, sometimes it downloads only one pdf file. But every time, it downloads at least one pdf file if not every other pdf files.

# -*- coding: utf-8 -*-
import scrapy
import cv2
import pytesseract
from io import BytesIO
from election_data.items  import ElectionDataItem
import os
from pathlib import Path

class ElectionSpider(scrapy.Spider):
    name = 'election'
    allowed_domains = ['ceo.maharashtra.gov.in']
    start_urls = ['https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx']
    base_path = "D:\\Projects\\scrape_data\\data"

    def parse(self, response):
        district = response.css('select#Content_DistrictList > option::attr(value)')[1].extract()
        district_name = response.css('select#Content_DistrictList > option::text')[1].extract()
        district_path = os.path.join(self.base_path, district_name.replace(' ', '_'))
        os.mkdir(district_path)
        data = {
            '__EVENTTARGET' : response.css('select#Content_DistrictList::attr(name)').extract_first(),
            '__EVENTARGUMENT' : '',
            '__LASTFOCUS' : '', 
            '__VIEWSTATE' : response.css('input#__VIEWSTATE::attr(value)').extract_first(),
            '__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
            'ctl00$Content$DistrictList' : district,
            'ctl00$Content$txtcaptcha' : ''
        }
        meta = {'handle_httpstatus_all': True}
        request = scrapy.FormRequest(url=self.start_urls[0], method='POST', formdata=data, meta=meta, callback=self.parse_assembly)
        request.meta['district'] = district
        request.meta['district_path'] = district_path
        yield request

    def parse_assembly(self, response):
        print('parse_assembly')
        assembly = response.css('select#Content_AssemblyList > option::attr(value)')[1].extract()
        assembly_name = response.css('select#Content_AssemblyList > option::text')[1].extract()
        assembly_path = os.path.join(response.meta['district_path'], assembly_name.replace(' ', '_'))
        os.mkdir(assembly_path)
        data = {
            '__EVENTTARGET' : response.css('select#Content_AssemblyList::attr(name)').extract_first(),
            '__EVENTARGUMENT' : '',
            '__LASTFOCUS' : '', 
            '__VIEWSTATE' : response.css('input#__VIEWSTATE::attr(value)').extract_first(),
            '__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
            'ctl00$Content$DistrictList' : response.meta['district'],
            'ctl00$Content$AssemblyList' : assembly,
            'ctl00$Content$txtcaptcha' : ''
        }
        meta = {'handle_httpstatus_all': True}
        request = scrapy.FormRequest(url=self.start_urls[0], method='POST', formdata=data, meta=meta, callback=self.parse_part)
        request.meta['district'] = response.meta['district']
        request.meta['assembly'] = assembly
        request.meta['assembly_path'] = assembly_path
        yield request

    def parse_part(self, response):
        print('parse_part')
        part = response.css('select#Content_PartList > option::attr(value)')[1].extract()
        part_name = response.css('select#Content_PartList > option::text')[1].extract()
        part_path = os.path.join(response.meta['assembly_path'], part_name.replace(' ', '_'))
        os.mkdir(part_path)
        data = {
            '__EVENTTARGET' : response.css('select#Content_PartList::attr(name)').extract_first(),
            '__EVENTARGUMENT' : '',
            '__LASTFOCUS' : '', 
            '__VIEWSTATE' : response.css('input#__VIEWSTATE::attr(value)').extract_first(),
            '__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
            'ctl00$Content$DistrictList' : response.meta['district'],
            'ctl00$Content$AssemblyList' : response.meta['assembly'],
            'ctl00$Content$PartList' : part,
            'ctl00$Content$txtcaptcha' : ''
        }
        meta = {'handle_httpstatus_all': True}
        request = scrapy.FormRequest(url=self.start_urls[0], method='POST', formdata=data, meta=meta, callback=self.parse_captcha)
        request.meta['__VIEWSTATE'] = response.css('input#__VIEWSTATE::attr(value)').extract_first()
        request.meta['__EVENTVALIDATION'] = response.css('input#__EVENTVALIDATION::attr(value)').extract_first()
        request.meta['district'] = response.meta['district']
        request.meta['assembly'] = response.meta['assembly']
        request.meta['part'] = part
        request.meta['part_path'] = part_path
        yield request

    def parse_captcha(self, response):
        data_for_later = response
        request = scrapy.Request(url='https://ceo.maharashtra.gov.in/searchlist/Captcha.aspx', callback=self.store_image)
        request.meta['__VIEWSTATE'] = response.css('input#__VIEWSTATE::attr(value)').extract_first()
        request.meta['__EVENTVALIDATION'] = response.css('input#__EVENTVALIDATION::attr(value)').extract_first()
        request.meta['district'] = response.meta['district']
        request.meta['assembly'] = response.meta['assembly']
        request.meta['part'] = response.meta['part']
        request.meta['part_path'] = response.meta['part_path']
        request.meta['data_for_later'] = data_for_later
        yield request

    def store_image(self, response):
        captcha_target_filename = 'filename.png'
        # save the image for processing
        i = Image.open(BytesIO(response.body))
        i.save(captcha_target_filename)
        captcha_text = self.solve_captcha(captcha_target_filename)
        print(captcha_text)
        data = {
            '__EVENTTARGET' : '',
            '__EVENTARGUMENT' : '',
            '__LASTFOCUS' : '', 
            '__VIEWSTATE' : response.meta['__VIEWSTATE'],
            '__EVENTVALIDATION' : response.meta['__EVENTVALIDATION'],
            'ctl00$Content$DistrictList' : response.meta['district'],
            'ctl00$Content$AssemblyList' : response.meta['assembly'],
            'ctl00$Content$PartList' : response.meta['part'],
            'ctl00$Content$txtcaptcha' : captcha_text,
            'ctl00$Content$OpenButton': 'Open PDF'
        }
        captcha_form = response.meta['data_for_later']
        meta = {'handle_httpstatus_all': True}
        request = scrapy.FormRequest.from_response(captcha_form, method='POST', formdata=data, meta=meta, callback=self.get_pdf_list)
        request.meta['district'] = response.meta['district']
        request.meta['assembly'] = response.meta['assembly']
        request.meta['part'] = response.meta['part']
        request.meta['part_path'] = response.meta['part_path']
        request.meta['data_for_later'] = captcha_form
        yield request

    def get_pdf_list(self, response):
        print('get_pdf_list')
        data_for_later = response
        pdf_content = response.meta['data_for_later']
        meta = {'handle_httpstatus_all': True}
        for th, td in zip(response.css('table#Content_gvRollPDF > tr > th'), response.css('table#Content_gvRollPDF  tr > td')):
            data = {
                '__EVENTTARGET' : td.css('a::attr(href)').extract_first().split("'")[1],
                '__VIEWSTATE' : response.css('input#__VIEWSTATE::attr(value)').extract_first(),
                '__EVENTVALIDATION' : response.css('input#__EVENTVALIDATION::attr(value)').extract_first(),
                'ctl00$Content$DistrictList' : response.meta['district'],
                'ctl00$Content$AssemblyList': response.meta['assembly'],
                'ctl00$Content$PartList': response.meta['part']
            }
            print(td.css('a::attr(href)').extract_first().split("'")[1])
            request = scrapy.FormRequest(url=self.start_urls[0], method='POST', formdata=data, meta=meta, callback=self.download_pdf)
            request.meta['pdf_name'] = th.css('::text').extract_first()
            request.meta['part_path'] = response.meta['part_path']
            yield request

    def download_pdf(self, response):
        print('download_pdf')
        request = scrapy.Request(url='https://ceo.maharashtra.gov.in/searchlist/ViewRoll.aspx', callback=self.pdf_data, dont_filter=True)
        request.meta['pdf_name'] = response.meta['pdf_name']
        request.meta['part_path'] = response.meta['part_path']
        yield request

    def pdf_data(self, response):
        path = os.path.join(response.meta['part_path'], response.meta['pdf_name'].replace(' ', '_') + '.pdf')
        filename = Path(path)
        filename.write_bytes(response.body)
        print(path)

    def solve_captcha(self, image):
        image = cv2.imread(image,0)
        thresh = cv2.threshold(image, 220, 255, cv2.THRESH_BINARY)[1]

        kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
        close = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel)

        result = 255 - close
        cv2.imshow('thresh', thresh)
        cv2.imshow('close', close)
        cv2.imshow('result', result)

        return pytesseract.image_to_string(result)

请找到下面的scrapy日志:

Kindly find the below scrapy log:

(base) D:\Projects\GitHub\election_data>scrapy runspider election_data\spiders\election.py
2019-09-15 02:28:36 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: election_data)
2019-09-15 02:28:36 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c  28 May 2019), cryptography 2.7, Platform Windows-10-10.0.17763-SP0
2019-09-15 02:28:36 [scrapy.crawler] INFO: Overridden settings: {'AUTOTHROTTLE_DEBUG': True, 'AUTOTHROTTLE_ENABLED': True, 'BOT_NAME': 'election_data', 'DOWNLOAD_DELAY': 3, 'NEWSPIDER_MODULE': 'election_data.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_LOADER_WARN_ONLY': True, 'SPIDER_MODULES': ['election_data.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
2019-09-15 02:28:36 [scrapy.extensions.telnet] INFO: Telnet Password: 705359b7d6b3b682
2019-09-15 02:28:36 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.throttle.AutoThrottle']
2019-09-15 02:28:36 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-09-15 02:28:36 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-09-15 02:28:36 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-09-15 02:28:36 [scrapy.core.engine] INFO: Spider opened
2019-09-15 02:28:36 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-09-15 02:28:36 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-09-15 02:28:36 [scrapy.extensions.throttle] INFO: slot: ceo.maharashtra.gov.in | conc: 1 | delay: 5000 ms (+0) | latency:   82 ms | size:  1245 bytes
2019-09-15 02:28:36 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://ceo.maharashtra.gov.in/robots.txt> (referer: None)
2019-09-15 02:28:42 [scrapy.extensions.throttle] INFO: slot: ceo.maharashtra.gov.in | conc: 1 | delay: 3000 ms (-2000) | latency:   49 ms | size:  3961 bytes
2019-09-15 02:28:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx> (referer: None)
2019-09-15 02:28:47 [scrapy.extensions.throttle] INFO: slot: ceo.maharashtra.gov.in | conc: 1 | delay: 3000 ms (+0) | latency:   88 ms | size:  4877 bytes
2019-09-15 02:28:47 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx> (referer: https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx)
parse_assembly
2019-09-15 02:28:50 [scrapy.extensions.throttle] INFO: slot: ceo.maharashtra.gov.in | conc: 1 | delay: 3000 ms (+0) | latency:  116 ms | size: 20054 bytes
2019-09-15 02:28:50 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx> (referer: https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx)
parse_part
2019-09-15 02:28:55 [scrapy.extensions.throttle] INFO: slot: ceo.maharashtra.gov.in | conc: 1 | delay: 3000 ms (+0) | latency:  439 ms | size: 20050 bytes
2019-09-15 02:28:55 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx> (referer: https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx)
2019-09-15 02:28:59 [scrapy.extensions.throttle] INFO: slot: ceo.maharashtra.gov.in | conc: 1 | delay: 3000 ms (+0) | latency:   43 ms | size:  3965 bytes
2019-09-15 02:28:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ceo.maharashtra.gov.in/searchlist/Captcha.aspx> (referer: https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx)
cDDmt8
2019-09-15 02:29:04 [scrapy.extensions.throttle] INFO: slot: ceo.maharashtra.gov.in | conc: 1 | delay: 3000 ms (+0) | latency:  824 ms | size: 20576 bytes
2019-09-15 02:29:04 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx> (referer: https://ceo.maharashtra.gov.in/searchlist/Captcha.aspx)
get_pdf_list
ctl00$Content$gvRollPDF$ctl02$MRollLink
ctl00$Content$gvRollPDF$ctl02$SupplementsLink
ctl00$Content$gvRollPDF$ctl02$SupplementsTwoLink
2019-09-15 02:29:07 [scrapy.extensions.throttle] INFO: slot: ceo.maharashtra.gov.in | conc: 1 | delay: 3000 ms (+0) | latency:  178 ms | size: 20639 bytes
2019-09-15 02:29:07 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx> (referer: https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx)
download_pdf
2019-09-15 02:29:10 [scrapy.extensions.throttle] INFO: slot: ceo.maharashtra.gov.in | conc: 1 | delay: 3000 ms (+0) | latency:   83 ms | size: 20639 bytes
2019-09-15 02:29:10 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx> (referer: https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx)
download_pdf
2019-09-15 02:29:13 [scrapy.extensions.throttle] INFO: slot: ceo.maharashtra.gov.in | conc: 1 | delay: 3000 ms (+0) | latency:   84 ms | size: 20639 bytes
2019-09-15 02:29:13 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx> (referer: https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx)
download_pdf
2019-09-15 02:29:18 [scrapy.extensions.throttle] INFO: slot: ceo.maharashtra.gov.in | conc: 1 | delay: 3000 ms (+0) | latency:  569 ms | size:155714 bytes
2019-09-15 02:29:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ceo.maharashtra.gov.in/searchlist/ViewRoll.aspx> (referer: https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx)
D:\Projects\GitHub\election_data\data\Ahmednagar\216_-_Akole_(ST)\1_-_Pachpathawadi\Mother_Roll.pdf
2019-09-15 02:29:22 [scrapy.extensions.throttle] INFO: slot: ceo.maharashtra.gov.in | conc: 1 | delay: 3000 ms (+0) | latency:  462 ms | size:155714 bytes
2019-09-15 02:29:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ceo.maharashtra.gov.in/searchlist/ViewRoll.aspx> (referer: https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx)
D:\Projects\GitHub\election_data\data\Ahmednagar\216_-_Akole_(ST)\1_-_Pachpathawadi\supplementary_2.pdf
2019-09-15 02:29:25 [scrapy.extensions.throttle] INFO: slot: ceo.maharashtra.gov.in | conc: 1 | delay: 3000 ms (+0) | latency:  454 ms | size:155714 bytes
2019-09-15 02:29:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ceo.maharashtra.gov.in/searchlist/ViewRoll.aspx> (referer: https://ceo.maharashtra.gov.in/searchlist/SearchRollPDF.aspx)
D:\Projects\GitHub\election_data\data\Ahmednagar\216_-_Akole_(ST)\1_-_Pachpathawadi\supplementary_1.pdf
2019-09-15 02:29:25 [scrapy.core.engine] INFO: Closing spider (finished)
2019-09-15 02:29:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 103807,
 'downloader/request_count': 13,
 'downloader/request_method_count/GET': 6,
 'downloader/request_method_count/POST': 7,
 'downloader/response_bytes': 607088,
 'downloader/response_count': 13,
 'downloader/response_status_count/200': 12,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 9, 14, 20, 59, 25, 458688),
 'log_count/DEBUG': 13,
 'log_count/INFO': 22,
 'request_depth_max': 7,
 'response_received_count': 13,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 12,
 'scheduler/dequeued/memory': 12,
 'scheduler/enqueued': 12,
 'scheduler/enqueued/memory': 12,
 'start_time': datetime.datetime(2019, 9, 14, 20, 58, 36, 817768)}
2019-09-15 02:29:25 [scrapy.core.engine] INFO: Spider closed (finished)

请帮助我解决这个问题.

Kindly help me in solving this problem.

推荐答案

很可能每个下载 PDF 的请求都在更改 ASP 会话状态.所以为了下载所有的PDF,你需要按顺序进行下载:

It's very likely that every request for downloading a PDF is changing the ASP session state. So in order to download all PDFs, you need to do the downloading sequentially:

  1. 创建下载 PDF 1 的请求
  2. 确保您更新了 PDF 下载附带的 cookie 等
  3. 创建下载 PDF 2 的请求...等

这篇关于使用 Scrapy 在 .asp 网站上下载所有 pdf 文件时出现问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆