Scrapy 工作正常,直到 asp 站点的第 12 页,然后出现 500 错误 [英] Scrapy works fine until page 12 of asp site, then 500 error

查看:40
本文介绍了Scrapy 工作正常,直到 asp 站点的第 12 页,然后出现 500 错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的第一个 Python/Scrapy 抓取项目.网站是

这些是您可以从当前页面(具有当前视图状态)访问的此搜索中仅有的 10 个链接.可以从搜索的第 11 页访问接下来的 10 个.

一种可能的解决方案是检查 parse_page() 如果您在第 11 页(或第 21,或第 31...),如果是,则为接下来的 10 页创建请求页.

此外,您只需要填充要更改的 formdataFormRequest.from_response() 将处理隐藏输入字段中可用的那些,例如例如__VIEWSTATE__EVENTVALIDATION.

My first scraping project with Python/Scrapy. Site is http://pabigtrees.com/ with 78 pages and 20 items (trees) per page. This is the full spider with a few changes to provide a minimal demonstration (scraping only one value per page):

import scrapy
from pabigtrees.items import Tree

class TreesSpider(scrapy.Spider):
  name = "trees"
  start_urls = ["http://pabigtrees.com/view_tree.aspx"]
  allowed_domains = ["pabigtrees.com"]
  download_delay = 2

  def parse(self, response):
    for page in [1,11,12]:
    #for page in range(1,79):
      if page == 1:
        yield scrapy.FormRequest.from_response(
        response,
        #callback=self.parse_page
        callback=self.parse_test
        )
      else:
        yield scrapy.FormRequest.from_response(
          response,
          formdata={
            '__EVENTTARGET': 'ctl00$ContentPlaceHolder1$GridView1',
            '__EVENTARGUMENT': "Page$" + str(page),
            'ctl00$ContentPlaceHolder1$genus_latin': '0',
            'ctl00$ContentPlaceHolder1$genus_common': '0',
            'ctl00$ContentPlaceHolder1$county': '0',
            '__VIEWSTATE': response.css('input#__VIEWSTATE::attr(value)').extract_first(),
            '__VIEWSTATEGENERATOR': response.css('input#__VIEWSTATEGENERATOR::attr(value)').extract_first(),
            '__SCROLLPOSITIONX': response.css('input#__SCROLLPOSITIONX::attr(value)').extract_first(),
            '__SCROLLPOSITIONY': response.css('input#__SCROLLPOSITIONY::attr(value)').extract_first(),
            '__EVENTVALIDATION': response.css('input#__EVENTVALIDATION::attr(value)').extract_first()
          },
          #callback=self.parse_page
          callback=self.parse_test
        )

  def parse_test(self, response):
    yield {
      'county':response.xpath('//a[contains(@href,"Select$1''")]/../../../td[5]/font/text()').extract_first()
    }

  def parse_page(self, response):
    for tree in range(0,20):

      yield scrapy.FormRequest.from_response(
        response,
        formdata={
          '__EVENTTARGET': 'ctl00$ContentPlaceHolder1$GridView1',
          '__EVENTARGUMENT': "Select$" + str(tree)
        },        meta={'county':response.xpath('//a[contains(@href,"Select$'+str(tree)+'")]/../../../td[5]/font/text()').extract_first()}, # save the county from the list page because it is not available on the detail page
        callback=self.parse_results
      )

  def parse_results(self, response):
    item = Tree()
    genus = response.css('span#ctl00_ContentPlaceHolder1_tree_genus::text').extract()
    species = response.css('span#ctl00_ContentPlaceHolder1_tree_species::text').extract()
    circumference = response.css('span#ctl00_ContentPlaceHolder1_lblcircum::text').extract()
    spread = response.css('span#ctl00_ContentPlaceHolder1_lblSpread::text').extract()
    height = response.css('span#ctl00_ContentPlaceHolder1_lblHeight::text').extract()
    points = response.css('span#ctl00_ContentPlaceHolder1_lblPoints::text').extract()
    address = response.css('span#ctl00_ContentPlaceHolder1_lblAddress::text').extract()
    crew = response.xpath('//td[text()="Measuring Crew: "]/following-sibling::td/text()').extract()
    nominator = response.xpath('//td[text()="Original Nominator: "]/following-sibling::td/text()').extract()
    comments = response.xpath('//td[text()="Comments: "]/following-sibling::td/text()').extract()
    gps = response.xpath('//td[text()="GPS Coordinates: "]/following-sibling::td/text()').extract()
    technique = response.css('span#ctl00_ContentPlaceHolder1_lblTech::text').extract()
    yearnominated = response.css('span#ctl00_ContentPlaceHolder1_lblNom::text').extract()
    yearlastmeasured = response.css('span#ctl00_ContentPlaceHolder1_lblMeasured::text').extract()
    item['a_county'] = response.meta['county']
    item['b_genus'] = genus
    item['c_species'] = species
    item['d_circumference'] = circumference
    item['e_spread'] = spread
    item['f_height'] = height
    item['g_points'] = points
    item['h_address'] = address
    item['i_crew'] = crew
    item['j_nominator'] = nominator
    item['k_comments'] = comments
    item['l_gps'] = gps
    item['m_technique'] = technique
    item['n_yearnominated'] = yearnominated
    item['o_yearlastmeasured'] = yearlastmeasured
    return item

The crawler works fine up through page 11. On page 12 and above, I get 500 errors. I believe it has something to do with the pagination, but I think I am sending the correct VIEWSTATE etc. Here’s the output:

(python3) Al-Green:pabigtrees Tony$ scrapy crawl trees -o trees.csv
2018-04-14 15:31:18 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: pabigtrees)
2018-04-14 15:31:18 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 05:52:31) - [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.5.0-x86_64-i386-64bit
2018-04-14 15:31:18 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'pabigtrees', 'FEED_FORMAT': 'csv', 'FEED_URI': 'trees.csv', 'NEWSPIDER_MODULE': 'pabigtrees.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['pabigtrees.spiders']}
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-04-14 15:31:18 [scrapy.core.engine] INFO: Spider opened
2018-04-14 15:31:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-14 15:31:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-14 15:31:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://pabigtrees.com/robots.txt> (referer: None)
2018-04-14 15:31:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://pabigtrees.com/view_tree.aspx> (referer: None)
2018-04-14 15:31:30 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://pabigtrees.com/view_tree.aspx> (referer: http://pabigtrees.com/view_tree.aspx)
2018-04-14 15:31:30 [scrapy.core.scraper] DEBUG: Scraped from <200 http://pabigtrees.com/view_tree.aspx>
{'county': 'Dauphin'}
2018-04-14 15:31:33 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://pabigtrees.com/view_tree.aspx> (referer: http://pabigtrees.com/view_tree.aspx)
2018-04-14 15:31:33 [scrapy.core.scraper] DEBUG: Scraped from <200 http://pabigtrees.com/view_tree.aspx>
{'county': 'Delaware'}
2018-04-14 15:31:35 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://pabigtrees.com/view_tree.aspx> (failed 1 times): 500 Internal Server Error
2018-04-14 15:31:37 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://pabigtrees.com/view_tree.aspx> (failed 2 times): 500 Internal Server Error
2018-04-14 15:31:39 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <POST http://pabigtrees.com/view_tree.aspx> (failed 3 times): 500 Internal Server Error
2018-04-14 15:31:39 [scrapy.core.engine] DEBUG: Crawled (500) <POST http://pabigtrees.com/view_tree.aspx> (referer: http://pabigtrees.com/view_tree.aspx)
2018-04-14 15:31:39 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 http://pabigtrees.com/view_tree.aspx>: HTTP status code is not handled or not allowed
2018-04-14 15:31:39 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-14 15:31:39 [scrapy.extensions.feedexport] INFO: Stored csv feed (2 items) in: trees.csv
2018-04-14 15:31:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 134895,
 'downloader/request_count': 7,
 'downloader/request_method_count/GET': 2,
 'downloader/request_method_count/POST': 5,
 'downloader/response_bytes': 98019,
 'downloader/response_count': 7,
 'downloader/response_status_count/200': 3,
 'downloader/response_status_count/404': 1,
 'downloader/response_status_count/500': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 4, 14, 19, 31, 39, 475017),
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/500': 1,
 'item_scraped_count': 2,
 'log_count/DEBUG': 11,
 'log_count/INFO': 9,
 'memusage/max': 50180096,
 'memusage/startup': 50176000,
 'request_depth_max': 1,
 'response_received_count': 5,
 'retry/count': 2,
 'retry/max_reached': 1,
 'retry/reason_count/500 Internal Server Error': 2,
 'scheduler/dequeued': 6,
 'scheduler/dequeued/memory': 6,
 'scheduler/enqueued': 6,
 'scheduler/enqueued/memory': 6,
 'start_time': datetime.datetime(2018, 4, 14, 19, 31, 18, 563326)}
2018-04-14 15:31:39 [scrapy.core.engine] INFO: Spider closed (finished)

I’m stumped, thanks!

解决方案

The __VIEWSTATE is indeed what is causing you trouble.

If you take a look at the navigation of the site you're trying to scrape, you'll see it only links to 10 other pages:

Those are the only 10 links of this search you're allowed to access from the current page (with the current view state). The next 10 will be accessible from page 11 of the search.

One possible solution would be to check in parse_page() if you're on page 11 (or 21, or 31...), and if so, create the requests for the next 10 pages.

Also, you only need to populate the formdata you want to change, FormRequest.from_response() will take care of the ones available in hidden input fields, such as e.g. __VIEWSTATE or __EVENTVALIDATION.

这篇关于Scrapy 工作正常,直到 asp 站点的第 12 页,然后出现 500 错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆