使用 Scrapy 抓取站点时出现 log_count/ERROR [英] log_count/ERROR while scraping site with Scrapy
问题描述
我在使用 Scrapy 抓取网站时收到以下 log_count/ERROR.我可以看到它发出了 43 个请求并得到了 43 个响应.一切看起来都很好.那么错误是什么?:
I am getting the following log_count/ERROR while scraping a site with Scrapy. I can see that it has made 43 requests and got 43 responses. Everything looks fine. Then what the error for?:
2018-03-19 00:31:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 18455,
'downloader/request_count': 43,
'downloader/request_method_count/GET': 43,
'downloader/response_bytes': 349500,
'downloader/response_count': 43,
'downloader/response_status_count/200': 38,
'downloader/response_status_count/301': 5,
'dupefilter/filtered': 39,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 3, 18, 15, 31, 30, 227072),
'item_scraped_count': 11,
'log_count/DEBUG': 56,
'log_count/ERROR': 21,
'log_count/INFO': 8,
'memusage/max': 53444608,
'memusage/startup': 53444608,
'request_depth_max': 1,
'response_received_count': 38,
'scheduler/dequeued': 40,
'scheduler/dequeued/memory': 40,
'scheduler/enqueued': 40,
'scheduler/enqueued/memory': 40,
'spider_exceptions/AttributeError': 21,
'start_time': datetime.datetime(2018, 3, 18, 15, 31, 20, 91856)}
2018-03-19 00:31:30 [scrapy.core.engine] INFO: Spider closed (finished)
这是我的蜘蛛代码:
from scrapy import Spider
from scrapy.http import Request
import re
class EventSpider(Spider):
name = 'event' #name of the spider
allowed_domains = ['.....com']
start_urls = ['http://.....com',
'http://.....com',
'http://.....com',
'http://.....com',]
def parse(self, response):
events = response.xpath('//h2/a/@href').extract()
#events = response.xpath('//a[@class = "event-overly"]').extract()
for event in events:
absolute_url = response.urljoin(event)
yield Request(absolute_url, callback = self.parse_event)
def parse_event(self, response):
title = response.xpath('//h1/text()').extract_first()
start_date = response.xpath('//div/p/text()')[0]. extract()
start_date_final = re.search("^[0-9]{1,2}(th|st|nd|rd)\s[A-Z][a-z]{2}\s[0-9]{4}", start_date)
#start_date_final2 = start_date_final.group(0)
end_date = response.xpath('//div/p/text()')[0]. extract()
end_date_final = re.search("\s[0-9]{1,2}(th|st|nd|rd)\s[A-Z][a-z]{2}\s[0-9]{4}", end_date)
email = response.xpath('//*[@id="more-email-with-dots"]/@value').extract_first()
email_final = re.findall("[a-zA-Z0-9_.+-]+@(?!....)[\.[a-zA-Z0-9-.]+",email)
description = response.xpath('//*[@class = "events-discription-block"]//p//text()').extract()
start_time = response.xpath('//div/p/text()')[1]. extract()
venue = response.xpath('//*[@id ="more-text-with-dots"]/@value').extract()
yield{
'title': title,
'start_date': start_date_final.group(0),
'end_date': end_date_final.group(0),
'start_time': start_time,
'venue': venue,
'email': email_final,
'description': description
}
我对抓取世界完全陌生.如何克服这个错误?
I am absolutely new to scraping world. How to overcome this error?
推荐答案
该输出显示记录了 21 错误;您还可以看到所有这些都是 AttributeError
s.
That output shows that 21 error was logged; you can also see all of those were AttributeError
s.
如果您查看日志输出的其余部分,您将看到错误本身:
If you look at the rest of the log output, you will see the errors themselves:
Traceback (most recent call last):
(...)
'end_date': end_date_final.group(0),
AttributeError: 'NoneType' object has no attribute 'group'
由此,您可以看到 end_date_final
的正则表达式并不总能找到匹配项.
From this, you can see that your regex for end_date_final
doesn't always find a match.
这篇关于使用 Scrapy 抓取站点时出现 log_count/ERROR的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!