Scrapy crawler - 创建 10,000 个蜘蛛还是一个蜘蛛爬取 10,000 个域? [英] Scrapy crawler - creating a 10,000 spiders or one spider crawling 10,000 domains?

查看:42
本文介绍了Scrapy crawler - 创建 10,000 个蜘蛛还是一个蜘蛛爬取 10,000 个域?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要抓取多达 10,000 个网站

I need to crawl upto 10,000 websites

因为每个网站都是独一无二的,有自己的 HTML 结构,并且需要自己的 XPATH & 逻辑.创建和委托 Request 对象.我很想为每个网站创建一个独特的蜘蛛

since every website is unique with its own HTML structure and requires its own logic of XPATH & creating and delegating Request objects. I'm tempted to create a unique spider for each website

但这是最好的前进方式吗?.我是否应该有一个蜘蛛并在 start_urlsallowed_domains 中添加所有 10,000 个网站,编写抓取库并继续使用?

But is this the best way forward?. Should i perhaps have a single spider and add all the 10,000 websites in the start_urls and allowed_domains, write scraping libraries and go for it?

这方面的最佳做法是什么?

which is the best practice in regards to this?

推荐答案

我遇到了类似的问题,我选择了中间道路.

I faced a similar problem, and I took a middle road.

当您最终处理数据时,您将遇到的大部分数据(很可能)会以相同的方式进行处理.这意味着您需要的大部分逻辑都可以重用.具体包括在何处查找数据以及如何将其转换为通用格式.我建议如下:

Much of the data you will encounter will (likely) be handled the same way when you finally process it. That means much of the logic you need can be reused. The specifics include where to look for data and how to transform it into a common format. I suggest the following:

创建您的 MainSpider 类,其中包含您需要的大部分逻辑和任务.对于每个站点,子类 MainSpider 并根据需要定义逻辑模块.

Create your MainSpider class, containing most of the logic and tasks that you need. For each site, subclass MainSpider and define logic modules as required.

main_spider.py

main_spider.py

class MainSpider(object):
# Do things here
    def get_links(url)

        return links

spider_mysite.py

spider_mysite.py

from main_spider import MainSpider
class SpiderMysite(MainSpider):
    def get_data(links):
        for link in links:

            # Do more stuff. 

希望有帮助.

这篇关于Scrapy crawler - 创建 10,000 个蜘蛛还是一个蜘蛛爬取 10,000 个域?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆