如何使用scrapy从站点抓取有限数量的页面? [英] how to crawl a limited number of pages from a site using scrapy?

查看:41
本文介绍了如何使用scrapy从站点抓取有限数量的页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要抓取多个站点,而我只想抓取每个站点一定数量的页面.那么如何实现呢?

I need to crawl a number of sites and I only want to crawl a certain number of pages each site. So how to implement this?

我的想法是使用一个字典,其键是域名,值是已存储在 mongodb 中的页数.所以当一个页面被成功抓取并存储在数据库中时,该域的页面数将增加一个.如果数量大于最大数量,则蜘蛛应停止从该站点爬行.

My thought is use a dict which the key is the domain name and the value is the number of pages that have been stored in mongodb. so when a page is crawled and stored in the database successfully then the number of pages of this domain will increase by one. if the number is greater than the maximum number then the spider should stop crwling from this site.

下面是我的代码,但它不起作用.当 spider.crawledPagesPerSite[domain_name] 大于 spider.maximumPagesPerSite: 时,蜘蛛仍在爬行.

Below is my code but it didn't work. when spider.crawledPagesPerSite[domain_name] is greater than spider.maximumPagesPerSite:, the spiders is still crawling.

class AnExampleSpider(CrawlSpider):
name="anexample"
rules=(
    Rule(LinkExtractor(allow=r"/*.html"),
    callback="parse_url",follow=True),
)   
def __init__(self, url_file ): #, N=10,*a, **kw
    data = open(url_file, 'r').readlines() #[:N]
    self.allowed_domains = [ i.strip() for i in data ] 
    self.start_urls = ['http://' + domain for domain in self.allowed_domains]
    super(AnExampleSpider, self).__init__()#*a, **kw

    self.maximumPagesPerSite=100 #maximum pages each site
    self.crawledPagesPerSite={}
def parse_url(self, response):
    url=response.url
    item=AnExampleItem()     
    html_text=response.body
    extracted_text=parse_page.parse_page(html_text)
    item["url"]=url
    item["extracted_text"]=extracted_text
    return item

class MongoDBPipeline(object):
    def __init__(self):
        self.connection = pymongo.MongoClient( settings['MONGODB_SERVER'], settings['MONGODB_PORT'] )

    def process_item(self, item, spider):
        domain_name=tldextract.extract(item['url']).domain
        db = self.connection[domain_name] #use domain name as database name
        self.collection = db[settings['MONGODB_COLLECTION']]
        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!".format(data))
            if valid:
                self.collection.insert(dict(item))
                log.msg("Item added to MongoDB database!",level=log.DEBUG, spider=spider)
                if domain_name in spider.crawledPagesPerSite:
                    spider.crawledPagesPerSite[domain_name]+=1
                else:
                    spider.crawledPagesPerSite[domain_name]=1
                if spider.crawledPagesPerSite[domain_name]>spider.maximumPagesPerSite:
                    suffix=tldextract.extract(item['url']).suffix
                    domain_and_suffix=domain_name+"."+suffix

                    if domain_and_suffix in spider.allowed_domains:
                        spider.allowed_domains.remove(domain_and_suffix)
                        spider.rules[0].link_extractor.allow_domains.remove(domain_and_suffix)
                        return None
                return item

推荐答案

我自己是 Scrapy 的初学者,但是我结合了其他 StackOverflow 帖子中的两个答案来找到适合我的解决方案.假设您想在 N 个页面后停止抓取,那么您可以像这样导入 CloseSpider 异常:

I am a beginer in Scrapy myself, however I combined two answers from other StackOverflow posts to find a solution that works for me. Let's say you want to stop scraping afer N pages, then you can import the CloseSpider exception as such :

# To import it :
from scrapy.exceptions import CloseSpider


#Later to use it:
raise CloseSpider('message')

例如,您可以将其集成到解析器以在 N 个 url 后关闭蜘蛛:

You can for example integrate it to the parser to Close the spider after N urls :

N = 10 # Here change 10 to how many you want.
count = 0 # The count starts at zero.

def parse(self, response):
    # Return if more than N
    if self.count >= self.N:
        raise CloseSpider(f"Scarped {self.N} items. Eject!")
    # Increment to count by one:
    self.count += 1

    # Put here the rest the code for parsing

链接到我找到的原始帖子:

Link to the original posts I found :

  1. 强制蜘蛛停止爬虫
  2. Scrapy:如何限制抓取的网址数量在 SitemapSpider 中

这篇关于如何使用scrapy从站点抓取有限数量的页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆