使用scrapy从200k域中提取文本 [英] Extract text from 200k domains with scrapy

查看:66
本文介绍了使用scrapy从200k域中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是:我想从某个域中提取所有有价值的文本,例如 www.example.com.所以我去这个网站并访问最大深度为2的所有链接并将其写入csv文件.

My problem is: I want extract all valuable text from some domain for example www.example.com. So I go to this website and visit all the links with the maximal depth 2 and write it csv file.

我在 scrapy 中编写了模块,它使用 1 个进程并产生多个爬虫解决了这个问题,但效率低下 - 我能够爬取 ~1k 个域/~5k 个网站/h,据我所知,我的瓶颈是CPU(因为 GIL?).离开我的电脑一段时间后,我发现我的网络连接中断了.

I wrote the module in scrapy which solves this problem using 1 process and yielding multiple crawlers, but it is inefficient - I am able to crawl ~1k domains/~5k websites/h and as far as I can see my bottleneck is CPU (because of GIL?). After leaving my PC for some time I found that my network connection was broken.

当我想使用多个进程时,我刚刚从扭曲中得到错误:并行进程中 Scrapy Spider 的多重处理 所以这意味着我必须学习扭曲,与 asyncio 相比,我会说我不推荐使用,但这只是我的意见.

When I wanted to use multiple processes I've just got error from twisted: Multiprocessing of Scrapy Spiders in Parallel Processes So this mean I must learn twisted which I would say I deprecated, when compared to asyncio, but this only my opinion.

所以我有几个想法要做什么

So I have couples of ideas what to do

  • 反击并尝试学习twisted并使用Redis实现多处理和分布式队列,但我不认为scrapy是适合此类工作的工具.
  • 使用 pyspider - 它具有我需要的所有功能(我从未使用过)
  • 使用 nutch - 这太复杂了(我从未使用过)
  • 尝试构建我自己的分布式爬虫,但在爬取 4 个网站后,我发现了 4 个边缘情况:SSL、重复、超时.但是添加一些修改很容易,例如:集中爬行.

您推荐什么解决方案?

Edit1:共享代码

class ESIndexingPipeline(object):
    def __init__(self):
        # self.text = set()
        self.extracted_type = []
        self.text = OrderedSet()
        import html2text
        self.h = html2text.HTML2Text()
        self.h.ignore_links = True
        self.h.images_to_alt = True

    def process_item(self, item, spider):
        body = item['body']
        body = self.h.handle(str(body, 'utf8')).split('\n')

        first_line = True
        for piece in body:
            piece = piece.strip(' \n\t\r')
            if len(piece) == 0:
                first_line = True
            else:
                e = ''
                if not self.text.empty() and not first_line and not regex.match(piece):
                    e = self.text.pop() + ' '
                e += piece
                self.text.add(e)
                first_line = False

        return item

    def open_spider(self, spider):
        self.target_id = spider.target_id
        self.queue = spider.queue

    def close_spider(self, spider):
        self.text = [e for e in self.text if comprehension_helper(langdetect.detect, e) == 'en']
        if spider.write_to_file:
            self._write_to_file(spider)

    def _write_to_file(self, spider):
        concat = "\n".join(self.text)
        self.queue.put([self.target_id, concat])

和电话:

def execute_crawler_process(targets, write_to_file=True, settings=None, parallel=800, queue=None):
    if settings is None:
        settings = DEFAULT_SPIDER_SETTINGS

    # causes that runners work sequentially
    @defer.inlineCallbacks
    def crawl(runner):
        n_crawlers_batch = 0
        done = 0
        n = float(len(targets))
        for url in targets:
            #print("target: ", url)
            n_crawlers_batch += 1
            r = runner.crawl(
                TextExtractionSpider,
                url=url,
                target_id=url,
                write_to_file=write_to_file,
                queue=queue)
            if n_crawlers_batch == parallel:
                print('joining')
                n_crawlers_batch = 0
                d = runner.join()
                # todo: print before yield
                done += n_crawlers_batch
                yield d  # download rest of data
        if n_crawlers_batch < parallel:
            d = runner.join()
            done += n_crawlers_batch
            yield d

        reactor.stop()

    def f():
        runner = CrawlerProcess(settings)
        crawl(runner)
        reactor.run()

    p = Process(target=f)
    p.start()

蜘蛛不是特别有趣.

推荐答案

您可以使用 Scrapy-Redis.它基本上是一个 Scrapy 蜘蛛,它从 Redis 的队列中获取要抓取的 URL.优点是可以启动多个并发爬虫,爬得更快.蜘蛛的所有实例都会从队列中拉取 URL,并在它们用完要抓取的 URL 时等待空闲.Scrapy-Redis 的存储库附带了一个示例项目来实现这一点.

You can use Scrapy-Redis. It is basically a Scrapy spider that fetches URLs to crawl from a queue in Redis. The advantage is that you can start many concurrent spiders so you can crawl faster. All the instances of the spider will pull the URLs from the queue and wait idle when they run out of URLs to crawl. The repository of Scrapy-Redis comes with an example project to implement this.

我使用 Scrapy-Redis 启动了 64 个爬虫实例,在大约 1 小时内抓取了 100 万个 URL.

I use Scrapy-Redis to fire up 64 instances of my crawler to scrape 1 million URLs in around 1 hour.

这篇关于使用scrapy从200k域中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆