抓取多个页面,保持独立 [英] Scraping multiple pages, staying independent

查看:33
本文介绍了抓取多个页面,保持独立的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取一堆页面.提供不同的数据罐,然后进行匹配.

I want to scrape a bunch of pages. Feeding different data pots and are then matched later on.

[Page1]-Get-PostProcessing-Store-[Pot1]-+
[Page2]-Get-PostProcessing-Store-[Pot2]-+--Match---[ResultPage]-REST-API
[Page3]-Get-PostProcessing-Store-[Pot3]-+
...

现在我想尽可能独立考虑每个页面的管道.有时页面需要 JavaScript 抓取功能,有时不需要.有时我还需要抓取图像,有时只需要抓取 PDF.

Now I want to be as independent as possible considering the pipeline for each page. Sometimes pages will need JavaScript scraping capabilities, sometimes not. Sometimes I need to also grab images, sometimes only PDFs.

我用一页和 Scrapy 做了一个原型.我真的有结构,我不知道如何拆分"scraper 和中间件对于每个页面都是独立的.另一方面,lxml 足够了吗?如何处理机器人并等待延迟以避免阻塞?添加消息队列有意义吗?

I did a prototype with one page and Scrapy. I really had the structure and I don't know how to "split" it up that scraper and middleware is independent for each page. On the other hand, is lxml enough? How do I handle robots and wait delays to avoid blocking? Does it make sense to add a message queue?

实现这一切的最佳方式是什么?请具体点!我的主要问题是组织代码的结构和要使用的工具.

What is the best way to implement all this? Please be specific! My major problem are structures organizing my code and the tools to use.

推荐答案

哇,这里有很多问题.=)

Whoa, lots of questions there. =)

这样一个广泛的问题很难具体说明,特别是不知道您对该工具的熟悉程度.

Hard to be specific for such a broad question, specially not knowing how familiar you are with the tool.

如果我理解正确的话,你有一个蜘蛛和一个中间件.我没有确切地了解您的中间件代码在做什么,但为了概念证明,我将从一个蜘蛛程序中的所有代码(也许还有 util 函数)开始,让您可以自由地为不同的提取技术使用不同的回调.

If I understood correctly, you have a spider and a middleware. I didn't get exactly what is your middleware code doing, but for a proof of concept I'd start with code all in one spider (and perhaps util functions), leaving you free to use different callbacks for the different extraction techniques.

一旦你开始工作,你就可以在需要时考虑制作一个通用的中间件(过早的抽象通常和过早的优化一样糟糕).

Once you have that working, then you can look into making a generic middleware if needed (premature abstraction is often just as bad as premature optimization).

这里有一些想法:

如果您事先知道要调用哪个代码来处理每个请求,只需为该请求设置适当的回调:

If you know beforehand which code you want to call for handling each request, just set the appropriate callback for that request:

def parse(self, response):
    yield scrapy.Request('http://example.com/file.pdf', self.handle_pdf)
    yield scrapy.Request('http://example.com/next_page', self.handle_next_page)

def handle_pdf(self, response):
    "process the response for a PDF request"

def handle_next_page(self, response):
    "process the response for next page"

如果你事先不知道,你可以实现一个回调,相应地分派到其他适当的回调:

If you don't know beforehand, you can implement a callback that dispatch to other appropriate callbacks accordingly:

def parse(self, response):
    if self.should_grab_images(response):
        for it in self.grab_images(response):
            yield it
    if self.should_follow_links(response):
        for it in self.follow_links(response):
            yield it

lxml 够了吗?

可能吧.但是,学习 XPath 是个好主意,如果您还没有的话,可以充分利用它.这是一个很好的起点.

除非您需要执行 Javascript 代码,然后您可能想尝试插入 Selenium/PhantomJS 或 Splash.

Unless you need to execute Javascript code, and then you might want to try plugging into Selenium/PhantomJS or Splash.

如果不需要执行Javascript代码,但需要解析JS代码里面的数据,可以使用js2xml.

If you don't need to execute Javascript code, but need to parse data that is inside JS code, you can use js2xml.

要遵守 robots.txt,请设置 ROBOTSTXT_OBEYTrue.

要配置延迟,请设置 DOWNLOAD_DELAY.您还可以尝试 autothrottle 扩展 并查看 并发请求设置.

To configure a delay, set DOWNLOAD_DELAY. You may also try out the autothrottle extension and look into the concurrent requests settings.

嗯,这取决于您的用例,真的.如果您的抓取量非常大(数亿个网址或更多),这可能是有意义的.

Well, it depends on your use case, really. If you have a really big crawl (hundreds of millions of URLs or more), it might make sense.

但是您已经可以通过独立的 Scrapy 免费获得很多东西,包括在现有内存不足以容纳所有待处理 URL 时的基于磁盘的队列.

But you already get a lot for free with standalone Scrapy, including a disk-based queue when existing memory isn't enough to hold all pending URLs.

你可以配置调度程序将用于内存和磁盘队列的后端,并完全交换您自己版本的调度程序.

And you can configure the backends the scheduler will use for the memory and disk queues and also completely swap the scheduler with your own version.

我会从 Scrapy 和一个工作蜘蛛开始,然后迭代,在真正需要的地方改进.

I'd start with Scrapy and a working spider and iterate from that, improving where it's really needed.

我希望这会有所帮助.

这篇关于抓取多个页面,保持独立的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆