未定义解析回调 - Simple Webscraper (Scrapy) 仍未运行 [英] Parse callback is not defined - Simple Webscraper (Scrapy) still not running
问题描述
我在谷歌上搜索了半天,仍然无法继续.也许你有一些见解?
i googled half a day and still can't get it going. Maybe you got some insights?
我尝试不是从终端而是从脚本启动我的抓取工具.这在没有规则的情况下运行良好,只需产生正常的解析函数即可.
I tryed to start my scraper not from a terminal, but from a script. This works well without rules, just with yielding the normal parse function.
一旦我使用规则并将callback="parse"" 更改为callback="parse_item"",就没有任何效果了.
As soon as I use Rules and change "callback="parse"" to "callback="parse_item"", nothing works anymore.
我尝试根据解析函数中的产生请求创建一个爬虫.结果是:我只抓取了一个 URL,而不是域.
I tried creating a crawler based on yielding requests in my parse function. The result was: I only scraped a single URL, but not the domain.
制定规则似乎是要走的路.所以我实际上希望它运行而不是在解析函数中使用收益.
Having Rules seems to be the way to go. So I actually want this to run and not work with yields in the parse function.
import scrapy
from scrapy.crawler import CrawlerProcess
from bs4 import BeautifulSoup
from scrapy.http import Request
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
def beauty(response_dictionary):
html_response = response_dictionary["html"]
print(response_dictionary["url"])
for html in html_response:
soup = BeautifulSoup(html, 'lxml')
metatag = soup.find_all("meta")
print(metatag)
class MySpider(scrapy.Spider):
name = "MySpidername"
allowed_domains = ["www.bueffeln.net"]
start_urls = ['https://www.bueffeln.net']
rules = [Rule(LinkExtractor(allow=()), callback='parse_item', follow=True),]
def parse_item(self, response):
url_dictionary = {}
print(response.status)
url_dictionary["url"] = response.url
print(response.headers)
url_dictionary["html"] = response.xpath('//html').getall()
beauty(url_dictionary)
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()
错误似乎如下:
2019-11-18 18:14:56 [scrapy.utils.log] INFO: Scrapy 1.7.4 started (bot: scrapybot)
2019-11-18 18:14:56 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 19:29:22) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
2019-11-18 18:14:56 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2019-11-18 18:14:56 [scrapy.extensions.telnet] INFO: Telnet Password: 970cca12e7c43d67
2019-11-18 18:14:56 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-11-18 18:14:57 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-11-18 18:14:57 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-11-18 18:14:57 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-11-18 18:14:57 [scrapy.core.engine] INFO: Spider opened
2019-11-18 18:14:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-11-18 18:14:57 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-11-18 18:14:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bueffeln.net> (referer: None)
2019-11-18 18:14:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.bueffeln.net> (referer: None)
Traceback (most recent call last):
File "C:\Users\msi\PycharmProjects\test\venv\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\msi\PycharmProjects\test\venv\lib\site-packages\scrapy\spiders\__init__.py", line 80, in parse
raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))
NotImplementedError: MySpider.parse callback is not defined
2019-11-18 18:14:57 [scrapy.core.engine] INFO: Closing spider (finished)
2019-11-18 18:14:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 231,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 16695,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.435081,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 11, 18, 17, 14, 57, 454733),
'log_count/DEBUG': 1,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/NotImplementedError': 1,
'start_time': datetime.datetime(2019, 11, 18, 17, 14, 57, 19652)}
2019-11-18 18:14:57 [scrapy.core.engine] INFO: Spider closed (finished)
Process finished with exit code 0
推荐答案
Scrapy 使用 parse
回调来解析来自 start_urls
的 URL.您没有提供此类回调,这就是 Scrapy 无法处理您的 https://www.bueffeln.net
URL 的原因.
Scrapy uses parse
callback to parse URLs from start_urls
. You didn't provide such callback that's why Scrapy can't process your https://www.bueffeln.net
URL.
如果你想让你的代码工作,你需要添加 parse
回调(甚至是空的).您的 rules
将在 parse
回调后应用.
If you want your code to work you need to add parse
callback (even empty). Your rules
will be applied after parse
callback.
更新要使用规则,您需要 scrapy.CrawlSpider
:
UPDATE
To use rules you need scrapy.CrawlSpider
:
class MySpider(scrapy.CrawlSpider):
这篇关于未定义解析回调 - Simple Webscraper (Scrapy) 仍未运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!