未定义解析回调 - Simple Webscraper (Scrapy) 仍未运行 [英] Parse callback is not defined - Simple Webscraper (Scrapy) still not running

查看:39
本文介绍了未定义解析回调 - Simple Webscraper (Scrapy) 仍未运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在谷歌上搜索了半天,仍然无法继续.也许你有一些见解?

i googled half a day and still can't get it going. Maybe you got some insights?

我尝试不是从终端而是从脚本启动我的抓取工具.这在没有规则的情况下运行良好,只需产生正常的解析函数即可.

I tryed to start my scraper not from a terminal, but from a script. This works well without rules, just with yielding the normal parse function.

一旦我使用规则并将callback="parse"" 更改为callback="parse_item"",就没有任何效果了.

As soon as I use Rules and change "callback="parse"" to "callback="parse_item"", nothing works anymore.

我尝试根据解析函数中的产生请求创建一个爬虫.结果是:我只抓取了一个 URL,而不是域.

I tried creating a crawler based on yielding requests in my parse function. The result was: I only scraped a single URL, but not the domain.

制定规则似乎是要走的路.所以我实际上希望它运行而不是在解析函数中使用收益.

Having Rules seems to be the way to go. So I actually want this to run and not work with yields in the parse function.

import scrapy

from scrapy.crawler import CrawlerProcess
from bs4 import BeautifulSoup
from scrapy.http import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


def beauty(response_dictionary):
    html_response = response_dictionary["html"]
    print(response_dictionary["url"])
    for html in html_response:
        soup = BeautifulSoup(html, 'lxml')
        metatag = soup.find_all("meta")
        print(metatag)

class MySpider(scrapy.Spider):
    name = "MySpidername"
    allowed_domains = ["www.bueffeln.net"]
    start_urls = ['https://www.bueffeln.net']

    rules = [Rule(LinkExtractor(allow=()), callback='parse_item', follow=True),]

    def parse_item(self, response):
        url_dictionary = {}
        print(response.status)
        url_dictionary["url"] = response.url
        print(response.headers)
        url_dictionary["html"] = response.xpath('//html').getall()
        beauty(url_dictionary)


process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() 

错误似乎如下:

2019-11-18 18:14:56 [scrapy.utils.log] INFO: Scrapy 1.7.4 started (bot: scrapybot)
2019-11-18 18:14:56 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 19:29:22) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
2019-11-18 18:14:56 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2019-11-18 18:14:56 [scrapy.extensions.telnet] INFO: Telnet Password: 970cca12e7c43d67
2019-11-18 18:14:56 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-11-18 18:14:57 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-11-18 18:14:57 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-11-18 18:14:57 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-11-18 18:14:57 [scrapy.core.engine] INFO: Spider opened
2019-11-18 18:14:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-11-18 18:14:57 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-11-18 18:14:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bueffeln.net> (referer: None)
2019-11-18 18:14:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.bueffeln.net> (referer: None)
Traceback (most recent call last):
  File "C:\Users\msi\PycharmProjects\test\venv\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\msi\PycharmProjects\test\venv\lib\site-packages\scrapy\spiders\__init__.py", line 80, in parse
    raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))
NotImplementedError: MySpider.parse callback is not defined
2019-11-18 18:14:57 [scrapy.core.engine] INFO: Closing spider (finished)
2019-11-18 18:14:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 231,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 16695,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.435081,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 11, 18, 17, 14, 57, 454733),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/NotImplementedError': 1,
 'start_time': datetime.datetime(2019, 11, 18, 17, 14, 57, 19652)}
2019-11-18 18:14:57 [scrapy.core.engine] INFO: Spider closed (finished)

Process finished with exit code 0

推荐答案

Scrapy 使用 parse 回调来解析来自 start_urls 的 URL.您没有提供此类回调,这就是 Scrapy 无法处理您的 https://www.bueffeln.net URL 的原因.

Scrapy uses parse callback to parse URLs from start_urls. You didn't provide such callback that's why Scrapy can't process your https://www.bueffeln.net URL.

如果你想让你的代码工作,你需要添加 parse 回调(甚至是空的).您的 rules 将在 parse 回调后应用.

If you want your code to work you need to add parse callback (even empty). Your rules will be applied after parse callback.

更新要使用规则,您需要 scrapy.CrawlSpider:

UPDATE To use rules you need scrapy.CrawlSpider:

class MySpider(scrapy.CrawlSpider):

这篇关于未定义解析回调 - Simple Webscraper (Scrapy) 仍未运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆