如何使用 Scrapy 在页面内爬行? [英] How do I use Scrapy to crawl within pages?

查看：61 发布时间：2021/7/16 22:07:05 python web-scraping scrapy

本文介绍了如何使用 Scrapy 在页面内爬行?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 Python 和 Scrapy 来回答这个问题.

I am using Python and Scrapy for this question.

我正在尝试抓取网页 A，其中包含指向网页 B1、B2、B3、... 每个 B 的链接列表> 页面包含指向另一个页面的链接，C1、C2、C3、...，其中包含一个图像.

I am attempting to crawl webpage A, which contains a list of links to webpages B1, B2, B3, ... Each B page contains a link to another page, C1, C2, C3, ..., which contains an image.

所以，使用 Scrapy，伪代码中的想法是:

So, using Scrapy, the idea in pseudo-code is:

links = getlinks(A)
for link in links:
    B = getpage(link)
    C = getpage(B)
    image = getimage(C)

但是，当我尝试在 Scrapy 中解析多个页面时遇到了问题.这是我的代码:

However, I am running into a problem when trying to parse more than one page in Scrapy. Here is my code:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    links = hxs.select('...')

    items = []
    for link in links:
        item = CustomItem()
        item['name'] = link.select('...')
        # TODO: Somehow I need to go two pages deep and extract an image.
        item['image'] = ....

我该怎么做?

(注意:我的问题类似于使用多个蜘蛛在 Scrapy 的项目中，但我不确定如何从 Scrapy 的 Request 对象返回"值.)

(Note: My question is similar to Using multiple spiders at in the project in Scrapy but I am unsure how to "return" values from Scrapy's Request objects.)

推荐答案

在 scrapy 中，如果您需要发出更多请求，解析方法需要返回一个新的请求(使用yield，因为 scrapy 与发电机).在此请求中，您可以设置对所需函数的回调(要递归，只需再次传递 parse).这就是爬进页面的方式.

In scrapy the parse method needs to return a new Request if you need to issue more requests (useyield as scrapy works well with generators). Inside this request you can set a callback to the desired function (to be recursive just pass parse again). Thats the way to crawl into pages.

你可以查看这个递归爬虫为例

按照您的示例，更改将如下所示:

Following your example, the change would be something like this:

def parse(self, response):
    b_pages_links = getlinks(A)
    for link in b_pages_links:
        yield Request(link, callback = self.visit_b_page)

def visit_b_page(self, response):
    url_of_c_page = ...
    yield Request(url_of_c_page, callback = self.visit_c_page)

def visit_c_page(self, response):
    url_of_image = ...
    yield Request(url_of_image, callback = self.get_image)

def get_image(self, response):
    item = CustomItem()
    item['name'] = ... # get image name
    item['image'] = ... # get image data
    yield item

另请查看 scrapy 文档和这些随机代码片段.他们可以提供很多帮助:)

Also check the scrapy documentation and these random code snippets. They can help a lot :)

这篇关于如何使用 Scrapy 在页面内爬行?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用 Scrapy 在页面内爬行? [英] How do I use Scrapy to crawl within pages?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何使用 Scrapy 在页面内爬行? [英] How do I use Scrapy to crawl within pages?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭