如何使用 Scrapy 在页面内爬行? [英] How do I use Scrapy to crawl within pages?

查看:61
本文介绍了如何使用 Scrapy 在页面内爬行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Python 和 Scrapy 来回答这个问题.

I am using Python and Scrapy for this question.

我正在尝试抓取网页 A,其中包含指向网页 B1、B2、B3、... 每个 B 的链接列表> 页面包含指向另一个页面的链接,C1、C2、C3、...,其中包含一个图像.

I am attempting to crawl webpage A, which contains a list of links to webpages B1, B2, B3, ... Each B page contains a link to another page, C1, C2, C3, ..., which contains an image.

所以,使用 Scrapy,伪代码中的想法是:

So, using Scrapy, the idea in pseudo-code is:

links = getlinks(A)
for link in links:
    B = getpage(link)
    C = getpage(B)
    image = getimage(C)

但是,当我尝试在 Scrapy 中解析多个页面时遇到了问题.这是我的代码:

However, I am running into a problem when trying to parse more than one page in Scrapy. Here is my code:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    links = hxs.select('...')

    items = []
    for link in links:
        item = CustomItem()
        item['name'] = link.select('...')
        # TODO: Somehow I need to go two pages deep and extract an image.
        item['image'] = ....

我该怎么做?

(注意:我的问题类似于使用多个蜘蛛在 Scrapy 的项目中,但我不确定如何从 Scrapy 的 Request 对象返回"值.)

(Note: My question is similar to Using multiple spiders at in the project in Scrapy but I am unsure how to "return" values from Scrapy's Request objects.)

推荐答案

在 scrapy 中,如果您需要发出更多请求,解析方法需要返回一个新的请求(使用yield,因为 scrapy 与发电机).在此请求中,您可以设置对所需函数的回调(要递归,只需再次传递 parse).这就是爬进页面的方式.

In scrapy the parse method needs to return a new Request if you need to issue more requests (useyield as scrapy works well with generators). Inside this request you can set a callback to the desired function (to be recursive just pass parse again). Thats the way to crawl into pages.

你可以查看这个递归爬虫 为例

按照您的示例,更改将如下所示:

Following your example, the change would be something like this:

def parse(self, response):
    b_pages_links = getlinks(A)
    for link in b_pages_links:
        yield Request(link, callback = self.visit_b_page)

def visit_b_page(self, response):
    url_of_c_page = ...
    yield Request(url_of_c_page, callback = self.visit_c_page)

def visit_c_page(self, response):
    url_of_image = ...
    yield Request(url_of_image, callback = self.get_image)

def get_image(self, response):
    item = CustomItem()
    item['name'] = ... # get image name
    item['image'] = ... # get image data
    yield item

另请查看 scrapy 文档这些随机代码片段.他们可以提供很多帮助:)

Also check the scrapy documentation and these random code snippets. They can help a lot :)

这篇关于如何使用 Scrapy 在页面内爬行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆