如何使用 Scrapy 在页面内爬行? [英] How do I use Scrapy to crawl within pages?
问题描述
我正在使用 Python 和 Scrapy 来回答这个问题.
I am using Python and Scrapy for this question.
我正在尝试抓取网页 A,其中包含指向网页 B1、B2、B3、... 每个 B 的链接列表> 页面包含指向另一个页面的链接,C1、C2、C3、...,其中包含一个图像.
I am attempting to crawl webpage A, which contains a list of links to webpages B1, B2, B3, ... Each B page contains a link to another page, C1, C2, C3, ..., which contains an image.
所以,使用 Scrapy,伪代码中的想法是:
So, using Scrapy, the idea in pseudo-code is:
links = getlinks(A)
for link in links:
B = getpage(link)
C = getpage(B)
image = getimage(C)
但是,当我尝试在 Scrapy 中解析多个页面时遇到了问题.这是我的代码:
However, I am running into a problem when trying to parse more than one page in Scrapy. Here is my code:
def parse(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select('...')
items = []
for link in links:
item = CustomItem()
item['name'] = link.select('...')
# TODO: Somehow I need to go two pages deep and extract an image.
item['image'] = ....
我该怎么做?
(注意:我的问题类似于使用多个蜘蛛在 Scrapy 的项目中,但我不确定如何从 Scrapy 的 Request 对象返回"值.)
(Note: My question is similar to Using multiple spiders at in the project in Scrapy but I am unsure how to "return" values from Scrapy's Request objects.)
推荐答案
在 scrapy 中,如果您需要发出更多请求,解析方法需要返回一个新的请求(使用yield
,因为 scrapy 与发电机).在此请求中,您可以设置对所需函数的回调(要递归,只需再次传递 parse
).这就是爬进页面的方式.
In scrapy the parse method needs to return a new Request if you need to issue more requests (useyield
as scrapy works well with generators). Inside this request you can set a callback to the desired function (to be recursive just pass parse
again). Thats the way to crawl into pages.
你可以查看这个递归爬虫 为例
按照您的示例,更改将如下所示:
Following your example, the change would be something like this:
def parse(self, response):
b_pages_links = getlinks(A)
for link in b_pages_links:
yield Request(link, callback = self.visit_b_page)
def visit_b_page(self, response):
url_of_c_page = ...
yield Request(url_of_c_page, callback = self.visit_c_page)
def visit_c_page(self, response):
url_of_image = ...
yield Request(url_of_image, callback = self.get_image)
def get_image(self, response):
item = CustomItem()
item['name'] = ... # get image name
item['image'] = ... # get image data
yield item
另请查看 scrapy 文档 和 这些随机代码片段.他们可以提供很多帮助:)
Also check the scrapy documentation and these random code snippets. They can help a lot :)
这篇关于如何使用 Scrapy 在页面内爬行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!