使用scrapy从200k域中提取文本 [英] Extract text from 200k domains with scrapy

查看：66 发布时间：2021/6/11 18:42:39 python scrapy web-crawler nutch pyspider

本文介绍了使用scrapy从200k域中提取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的问题是:我想从某个域中提取所有有价值的文本，例如 www.example.com.所以我去这个网站并访问最大深度为2的所有链接并将其写入csv文件.

My problem is: I want extract all valuable text from some domain for example www.example.com. So I go to this website and visit all the links with the maximal depth 2 and write it csv file.

我在 scrapy 中编写了模块，它使用 1 个进程并产生多个爬虫解决了这个问题，但效率低下 - 我能够爬取 ~1k 个域/~5k 个网站/h，据我所知，我的瓶颈是CPU(因为 GIL?).离开我的电脑一段时间后，我发现我的网络连接中断了.

I wrote the module in scrapy which solves this problem using 1 process and yielding multiple crawlers, but it is inefficient - I am able to crawl ~1k domains/~5k websites/h and as far as I can see my bottleneck is CPU (because of GIL?). After leaving my PC for some time I found that my network connection was broken.

当我想使用多个进程时，我刚刚从扭曲中得到错误:并行进程中 Scrapy Spider 的多重处理所以这意味着我必须学习扭曲，与 asyncio 相比，我会说我不推荐使用，但这只是我的意见.

When I wanted to use multiple processes I've just got error from twisted: Multiprocessing of Scrapy Spiders in Parallel Processes So this mean I must learn twisted which I would say I deprecated, when compared to asyncio, but this only my opinion.

所以我有几个想法要做什么

So I have couples of ideas what to do

反击并尝试学习twisted并使用Redis实现多处理和分布式队列，但我不认为scrapy是适合此类工作的工具.
使用 pyspider - 它具有我需要的所有功能(我从未使用过)
使用 nutch - 这太复杂了(我从未使用过)
尝试构建我自己的分布式爬虫，但在爬取 4 个网站后，我发现了 4 个边缘情况:SSL、重复、超时.但是添加一些修改很容易，例如:集中爬行.

您推荐什么解决方案?

Edit1:共享代码

class ESIndexingPipeline(object):
    def __init__(self):
        # self.text = set()
        self.extracted_type = []
        self.text = OrderedSet()
        import html2text
        self.h = html2text.HTML2Text()
        self.h.ignore_links = True
        self.h.images_to_alt = True

    def process_item(self, item, spider):
        body = item['body']
        body = self.h.handle(str(body, 'utf8')).split('\n')

        first_line = True
        for piece in body:
            piece = piece.strip(' \n\t\r')
            if len(piece) == 0:
                first_line = True
            else:
                e = ''
                if not self.text.empty() and not first_line and not regex.match(piece):
                    e = self.text.pop() + ' '
                e += piece
                self.text.add(e)
                first_line = False

        return item

    def open_spider(self, spider):
        self.target_id = spider.target_id
        self.queue = spider.queue

    def close_spider(self, spider):
        self.text = [e for e in self.text if comprehension_helper(langdetect.detect, e) == 'en']
        if spider.write_to_file:
            self._write_to_file(spider)

    def _write_to_file(self, spider):
        concat = "\n".join(self.text)
        self.queue.put([self.target_id, concat])

和电话:

def execute_crawler_process(targets, write_to_file=True, settings=None, parallel=800, queue=None):
    if settings is None:
        settings = DEFAULT_SPIDER_SETTINGS

    # causes that runners work sequentially
    @defer.inlineCallbacks
    def crawl(runner):
        n_crawlers_batch = 0
        done = 0
        n = float(len(targets))
        for url in targets:
            #print("target: ", url)
            n_crawlers_batch += 1
            r = runner.crawl(
                TextExtractionSpider,
                url=url,
                target_id=url,
                write_to_file=write_to_file,
                queue=queue)
            if n_crawlers_batch == parallel:
                print('joining')
                n_crawlers_batch = 0
                d = runner.join()
                # todo: print before yield
                done += n_crawlers_batch
                yield d  # download rest of data
        if n_crawlers_batch < parallel:
            d = runner.join()
            done += n_crawlers_batch
            yield d

        reactor.stop()

    def f():
        runner = CrawlerProcess(settings)
        crawl(runner)
        reactor.run()

    p = Process(target=f)
    p.start()

蜘蛛不是特别有趣.

使用scrapy从200k域中提取文本 [英] Extract text from 200k domains with scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用scrapy从200k域中提取文本 [英] Extract text from 200k domains with scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭