使用scrapy进行CPU密集型解析 [英] CPU-intensive parsing with scrapy

查看:69
本文介绍了使用scrapy进行CPU密集型解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

http://doc.scrapy 上的 CONCURRENT_ITEMS 部分.org/en/latest/topics/settings.html#concurrent-items 将其定义为:

要处理的最大并发项目数(每个响应)在项目处理器(也称为项目管道)中并行.

Maximum number of concurrent items (per response) to process in parallel in the Item Processor (also known as the Item Pipeline).

这让我很困惑.这是否意味着发送到管道的项目是并行处理的,即.真的是多进程吗?

This confuses me. Does this imply that the items sent to the pipeline are processed in parallel ie. really multiprocessed?

假设我的解析涉及很多 lxml 查询和 xpath'ing.我应该在蜘蛛的 parse 方法本身中执行它们,还是应该发送一个包含整个响应的 Item 并让自定义管道类通过解析响应正文来填充 Item 的字段?

Suppose my parsing involves a lot of lxml querying and xpath'ing. Should I do them in the spider's parse method itself, or should I send an Item with the whole response in it and let custom pipeline classes populate the Item's fields by parsing the response body?

推荐答案

Requests 系统也可以并行工作,参见 http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-requests.Scrapy 旨在处理蜘蛛本身的请求和解析,回调方法使其异步,并且默认情况下多个请求确实并行工作.

The Requests system also works in parallel, see http://doc.scrapy.org/en/latest/topics/settings.html#concurrent-requests. Scrapy is designed to handle requesting and parsing in the spider itself, the callback methods make it asynchronous and by default multiple Requests work in parallel indeed.

并行处理的项目管道并不打算进行繁重的解析:它的目的是检查和验证您在每个项目中获得的值.(http://doc.scrapy.org/en/latest/topics/item-pipeline.html)

The item pipeline, which does process in parallel, isn't intended to do heavy parsing: it is rather meant to check and validate the values you got in each item. (http://doc.scrapy.org/en/latest/topics/item-pipeline.html)

因此,您应该在蜘蛛本身中进行查询,因为它们被设计为存在于那里.来自关于蜘蛛的文档:

Therefore you should do your queries in the spider itself, as they are designed to be there. From the docs on spiders:

Spider 是定义如何抓取某个站点(或一组站点)的类,包括如何执行抓取(即跟踪链接)以及如何从其页面中提取结构化数据(即抓取项目).

Spiders are classes which define how a certain site (or group of sites) will be scraped, including how to perform the crawl (ie. follow links) and how to extract structured data from their pages (ie. scraping items).

这篇关于使用scrapy进行CPU密集型解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆