Scrapy 工作正常,直到 asp 站点的第 12 页,然后出现 500 错误 [英] Scrapy works fine until page 12 of asp site, then 500 error
问题描述
我的第一个 Python/Scrapy 抓取项目.网站是
这些是您可以从当前页面(具有当前视图状态)访问的此搜索中仅有的 10 个链接.可以从搜索的第 11 页访问接下来的 10 个.
一种可能的解决方案是检查 parse_page()
如果您在第 11 页(或第 21,或第 31...),如果是,则为接下来的 10 页创建请求页.
此外,您只需要填充要更改的 formdata
,FormRequest.from_response()
将处理隐藏输入字段中可用的那些,例如例如__VIEWSTATE
或 __EVENTVALIDATION
.
My first scraping project with Python/Scrapy. Site is http://pabigtrees.com/ with 78 pages and 20 items (trees) per page. This is the full spider with a few changes to provide a minimal demonstration (scraping only one value per page):
import scrapy
from pabigtrees.items import Tree
class TreesSpider(scrapy.Spider):
name = "trees"
start_urls = ["http://pabigtrees.com/view_tree.aspx"]
allowed_domains = ["pabigtrees.com"]
download_delay = 2
def parse(self, response):
for page in [1,11,12]:
#for page in range(1,79):
if page == 1:
yield scrapy.FormRequest.from_response(
response,
#callback=self.parse_page
callback=self.parse_test
)
else:
yield scrapy.FormRequest.from_response(
response,
formdata={
'__EVENTTARGET': 'ctl00$ContentPlaceHolder1$GridView1',
'__EVENTARGUMENT': "Page$" + str(page),
'ctl00$ContentPlaceHolder1$genus_latin': '0',
'ctl00$ContentPlaceHolder1$genus_common': '0',
'ctl00$ContentPlaceHolder1$county': '0',
'__VIEWSTATE': response.css('input#__VIEWSTATE::attr(value)').extract_first(),
'__VIEWSTATEGENERATOR': response.css('input#__VIEWSTATEGENERATOR::attr(value)').extract_first(),
'__SCROLLPOSITIONX': response.css('input#__SCROLLPOSITIONX::attr(value)').extract_first(),
'__SCROLLPOSITIONY': response.css('input#__SCROLLPOSITIONY::attr(value)').extract_first(),
'__EVENTVALIDATION': response.css('input#__EVENTVALIDATION::attr(value)').extract_first()
},
#callback=self.parse_page
callback=self.parse_test
)
def parse_test(self, response):
yield {
'county':response.xpath('//a[contains(@href,"Select$1''")]/../../../td[5]/font/text()').extract_first()
}
def parse_page(self, response):
for tree in range(0,20):
yield scrapy.FormRequest.from_response(
response,
formdata={
'__EVENTTARGET': 'ctl00$ContentPlaceHolder1$GridView1',
'__EVENTARGUMENT': "Select$" + str(tree)
}, meta={'county':response.xpath('//a[contains(@href,"Select$'+str(tree)+'")]/../../../td[5]/font/text()').extract_first()}, # save the county from the list page because it is not available on the detail page
callback=self.parse_results
)
def parse_results(self, response):
item = Tree()
genus = response.css('span#ctl00_ContentPlaceHolder1_tree_genus::text').extract()
species = response.css('span#ctl00_ContentPlaceHolder1_tree_species::text').extract()
circumference = response.css('span#ctl00_ContentPlaceHolder1_lblcircum::text').extract()
spread = response.css('span#ctl00_ContentPlaceHolder1_lblSpread::text').extract()
height = response.css('span#ctl00_ContentPlaceHolder1_lblHeight::text').extract()
points = response.css('span#ctl00_ContentPlaceHolder1_lblPoints::text').extract()
address = response.css('span#ctl00_ContentPlaceHolder1_lblAddress::text').extract()
crew = response.xpath('//td[text()="Measuring Crew: "]/following-sibling::td/text()').extract()
nominator = response.xpath('//td[text()="Original Nominator: "]/following-sibling::td/text()').extract()
comments = response.xpath('//td[text()="Comments: "]/following-sibling::td/text()').extract()
gps = response.xpath('//td[text()="GPS Coordinates: "]/following-sibling::td/text()').extract()
technique = response.css('span#ctl00_ContentPlaceHolder1_lblTech::text').extract()
yearnominated = response.css('span#ctl00_ContentPlaceHolder1_lblNom::text').extract()
yearlastmeasured = response.css('span#ctl00_ContentPlaceHolder1_lblMeasured::text').extract()
item['a_county'] = response.meta['county']
item['b_genus'] = genus
item['c_species'] = species
item['d_circumference'] = circumference
item['e_spread'] = spread
item['f_height'] = height
item['g_points'] = points
item['h_address'] = address
item['i_crew'] = crew
item['j_nominator'] = nominator
item['k_comments'] = comments
item['l_gps'] = gps
item['m_technique'] = technique
item['n_yearnominated'] = yearnominated
item['o_yearlastmeasured'] = yearlastmeasured
return item
The crawler works fine up through page 11. On page 12 and above, I get 500 errors. I believe it has something to do with the pagination, but I think I am sending the correct VIEWSTATE etc. Here’s the output:
(python3) Al-Green:pabigtrees Tony$ scrapy crawl trees -o trees.csv
2018-04-14 15:31:18 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: pabigtrees)
2018-04-14 15:31:18 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 05:52:31) - [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.5.0-x86_64-i386-64bit
2018-04-14 15:31:18 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'pabigtrees', 'FEED_FORMAT': 'csv', 'FEED_URI': 'trees.csv', 'NEWSPIDER_MODULE': 'pabigtrees.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['pabigtrees.spiders']}
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-04-14 15:31:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-04-14 15:31:18 [scrapy.core.engine] INFO: Spider opened
2018-04-14 15:31:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-14 15:31:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-14 15:31:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://pabigtrees.com/robots.txt> (referer: None)
2018-04-14 15:31:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://pabigtrees.com/view_tree.aspx> (referer: None)
2018-04-14 15:31:30 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://pabigtrees.com/view_tree.aspx> (referer: http://pabigtrees.com/view_tree.aspx)
2018-04-14 15:31:30 [scrapy.core.scraper] DEBUG: Scraped from <200 http://pabigtrees.com/view_tree.aspx>
{'county': 'Dauphin'}
2018-04-14 15:31:33 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://pabigtrees.com/view_tree.aspx> (referer: http://pabigtrees.com/view_tree.aspx)
2018-04-14 15:31:33 [scrapy.core.scraper] DEBUG: Scraped from <200 http://pabigtrees.com/view_tree.aspx>
{'county': 'Delaware'}
2018-04-14 15:31:35 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://pabigtrees.com/view_tree.aspx> (failed 1 times): 500 Internal Server Error
2018-04-14 15:31:37 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST http://pabigtrees.com/view_tree.aspx> (failed 2 times): 500 Internal Server Error
2018-04-14 15:31:39 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <POST http://pabigtrees.com/view_tree.aspx> (failed 3 times): 500 Internal Server Error
2018-04-14 15:31:39 [scrapy.core.engine] DEBUG: Crawled (500) <POST http://pabigtrees.com/view_tree.aspx> (referer: http://pabigtrees.com/view_tree.aspx)
2018-04-14 15:31:39 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 http://pabigtrees.com/view_tree.aspx>: HTTP status code is not handled or not allowed
2018-04-14 15:31:39 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-14 15:31:39 [scrapy.extensions.feedexport] INFO: Stored csv feed (2 items) in: trees.csv
2018-04-14 15:31:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 134895,
'downloader/request_count': 7,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 5,
'downloader/response_bytes': 98019,
'downloader/response_count': 7,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/404': 1,
'downloader/response_status_count/500': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 4, 14, 19, 31, 39, 475017),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/500': 1,
'item_scraped_count': 2,
'log_count/DEBUG': 11,
'log_count/INFO': 9,
'memusage/max': 50180096,
'memusage/startup': 50176000,
'request_depth_max': 1,
'response_received_count': 5,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/500 Internal Server Error': 2,
'scheduler/dequeued': 6,
'scheduler/dequeued/memory': 6,
'scheduler/enqueued': 6,
'scheduler/enqueued/memory': 6,
'start_time': datetime.datetime(2018, 4, 14, 19, 31, 18, 563326)}
2018-04-14 15:31:39 [scrapy.core.engine] INFO: Spider closed (finished)
I’m stumped, thanks!
The __VIEWSTATE
is indeed what is causing you trouble.
If you take a look at the navigation of the site you're trying to scrape, you'll see it only links to 10 other pages:
Those are the only 10 links of this search you're allowed to access from the current page (with the current view state). The next 10 will be accessible from page 11 of the search.
One possible solution would be to check in parse_page()
if you're on page 11 (or 21, or 31...), and if so, create the requests for the next 10 pages.
Also, you only need to populate the formdata
you want to change, FormRequest.from_response()
will take care of the ones available in hidden input fields, such as e.g. __VIEWSTATE
or __EVENTVALIDATION
.
这篇关于Scrapy 工作正常,直到 asp 站点的第 12 页,然后出现 500 错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!