使用 Scrapy 抓取站点时出现 log_count/ERROR [英] log_count/ERROR while scraping site with Scrapy

查看:65
本文介绍了使用 Scrapy 抓取站点时出现 log_count/ERROR的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用 Scrapy 抓取网站时收到以下 log_count/ERROR.我可以看到它发出了 43 个请求并得到了 43 个响应.一切看起来都很好.那么错误是什么?:

I am getting the following log_count/ERROR while scraping a site with Scrapy. I can see that it has made 43 requests and got 43 responses. Everything looks fine. Then what the error for?:

2018-03-19 00:31:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 18455,
 'downloader/request_count': 43,
 'downloader/request_method_count/GET': 43,
 'downloader/response_bytes': 349500,
 'downloader/response_count': 43,
 'downloader/response_status_count/200': 38,
 'downloader/response_status_count/301': 5,
 'dupefilter/filtered': 39,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 3, 18, 15, 31, 30, 227072),
 'item_scraped_count': 11,
 'log_count/DEBUG': 56,
 'log_count/ERROR': 21,
 'log_count/INFO': 8,
 'memusage/max': 53444608,
 'memusage/startup': 53444608,
 'request_depth_max': 1,
 'response_received_count': 38,
 'scheduler/dequeued': 40,
 'scheduler/dequeued/memory': 40,
 'scheduler/enqueued': 40,
 'scheduler/enqueued/memory': 40,
 'spider_exceptions/AttributeError': 21,
 'start_time': datetime.datetime(2018, 3, 18, 15, 31, 20, 91856)}
2018-03-19 00:31:30 [scrapy.core.engine] INFO: Spider closed (finished)

这是我的蜘蛛代码:

from scrapy import Spider
from scrapy.http import Request
import re

class EventSpider(Spider):
    name = 'event' #name of the spider
    allowed_domains = ['.....com']
    start_urls = ['http://.....com',
                  'http://.....com',
                  'http://.....com',
                  'http://.....com',]

    def parse(self, response):
        events = response.xpath('//h2/a/@href').extract()
        #events = response.xpath('//a[@class = "event-overly"]').extract()

        for event in events: 
              absolute_url = response.urljoin(event)
              yield Request(absolute_url, callback = self.parse_event)

    def parse_event(self, response):
          title = response.xpath('//h1/text()').extract_first()       
          start_date = response.xpath('//div/p/text()')[0]. extract()
          start_date_final = re.search("^[0-9]{1,2}(th|st|nd|rd)\s[A-Z][a-z]{2}\s[0-9]{4}", start_date)
          #start_date_final2 = start_date_final.group(0)          
          end_date = response.xpath('//div/p/text()')[0]. extract()
          end_date_final = re.search("\s[0-9]{1,2}(th|st|nd|rd)\s[A-Z][a-z]{2}\s[0-9]{4}", end_date)
          email = response.xpath('//*[@id="more-email-with-dots"]/@value').extract_first()
          email_final = re.findall("[a-zA-Z0-9_.+-]+@(?!....)[\.[a-zA-Z0-9-.]+",email)        
          description = response.xpath('//*[@class = "events-discription-block"]//p//text()').extract()
          start_time = response.xpath('//div/p/text()')[1]. extract() 
          venue = response.xpath('//*[@id ="more-text-with-dots"]/@value').extract()          
          yield{
              'title': title,
              'start_date': start_date_final.group(0),
              'end_date': end_date_final.group(0),
              'start_time': start_time,
              'venue': venue,
              'email': email_final,
              'description': description
          }  

我对抓取世界完全陌生.如何克服这个错误?

I am absolutely new to scraping world. How to overcome this error?

推荐答案

该输出显示记录了 21 错误;您还可以看到所有这些都是 AttributeErrors.

That output shows that 21 error was logged; you can also see all of those were AttributeErrors.

如果您查看日志输出的其余部分,您将看到错误本身:

If you look at the rest of the log output, you will see the errors themselves:

Traceback (most recent call last):
  (...)
    'end_date': end_date_final.group(0),
AttributeError: 'NoneType' object has no attribute 'group'

由此,您可以看到 end_date_final 的正则表达式并不总能找到匹配项.

From this, you can see that your regex for end_date_final doesn't always find a match.

这篇关于使用 Scrapy 抓取站点时出现 log_count/ERROR的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆