带有多个回调的 Scrapy CrawlSpider 规则 [英] Scrapy CrawlSpider rules with multiple callbacks
问题描述
我正在尝试创建一个实现了爬虫爬虫的 ExampleSpider.我的 ExampleSpider 应该能够处理只包含艺术家信息的页面,仅包含专辑信息的页面,以及包含专辑和艺术家信息的其他一些页面.
I'm tring to create an ExampleSpider which implements scrapy CrawlSpider. My ExampleSpider should be able to process pages containing only artist info, pages containing only album info, and some other pages which contains both album and artist info.
我能够处理前两种情况.但问题发生在第三种情况.我使用 parse_artist(response)
方法处理艺术家数据,parse_album(response)
方法处理专辑数据.我的问题是,如果一个页面同时包含艺术家和专辑数据,我应该如何定义我的规则?
I was able to handle First two scenarios. but the problem occurs in third scenario. I'm using parse_artist(response)
method to process artist data, parse_album(response)
method to process album data.
My question is, If a page contains both artist and album data, how should I define my rules?
- 我应该像下面这样吗?(相同网址格式的两条规则)
- 我应该多次回调吗?(scrapy 是否支持多个回调?)
还有其他方法吗?(有道)
- Shoud I do like below? (Two rules for same url pattern)
- Should I multiple callbacks? (Does scrapy support multiple callbacks?)
Is there other way to do it. (A proper way)
class ExampleSpider(CrawlSpider):
name = 'example'
start_urls = ['http://www.example.com']
rules = [
Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_artist', follow=True),
Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_album', follow=True),
# more rules .....
]
def parse_artist(self, response):
artist_item = ArtistItem()
try:
# do the scrape and assign to ArtistItem
except Exception:
# ignore for now
pass
return artist_item
pass
def parse_album(self, response):
album_item = AlbumItem()
try:
# do the scrape and assign to AlbumItem
except Exception:
# ignore for now
pass
return album_item
pass
pass
推荐答案
CrawlSpider
调用 _requests_to_follow()
方法来提取 url 并生成要关注的请求:
The CrawlSpider
calls _requests_to_follow()
method to extract urls and generate requests to follow:
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
if links and rule.process_links:
links = rule.process_links(links)
seen = seen.union(links)
for link in links:
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
yield rule.process_request(r)
如您所见:
- 记忆
urls
的变量seen
已被处理. - 每个
url
最多会被一个callback
解析.
- The variable
seen
memorizesurls
has been processed. - Every
url
will be parsed by at most onecallback
.
你可以定义一个parse_item()
来调用parse_artist()
和parse_album()
:
You can define a parse_item()
to call parse_artist()
and parse_album()
:
rules = [
Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_item', follow=True),
# more rules .....
]
def parse_item(self, response):
yield self.parse_artist(response)
yield self.parse_album(response)
这篇关于带有多个回调的 Scrapy CrawlSpider 规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!