如何使用Scrapy递归爬网子页面 [英] How to recursively crawl subpages with Scrapy

查看:168
本文介绍了如何使用Scrapy递归爬网子页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,基本上,我试图爬网具有一组类别的页面,抓取每个类别的名称,跟随与每个类别关联的子链接到具有一组子类别的页面,抓取它们的名称,然后跟随每个子类别到其关联页面并检索文本数据.最后,我想输出一个格式如下的json文件:

So basically I am trying to crawl a page with a set of categories, scrape the names of each category, follow a sublink associated with each category to a page with a set of subcategories, scrape their names, and then follow each subcategory to their associated page and retrieve text data. At the end I want to output a json file formatted somewhat like:

  1. 类别1名称
    • 子类别1名称
      • 该子类别页面上的数据
  1. Category 1 name
    • Subcategory 1 name
      • data from this subcategory's page
  • 此页面上的数据
  • 子类别1名称
    • n子类别页面中的数据
    • Subcategory 1 name
      • data from subcategory n's page

      最终我希望能够在ElasticSearch中使用此数据

      Eventually i want to be able to use this data with ElasticSearch

      我几乎没有Scrapy的经验,这是我到目前为止的经验(只是从第一页抓取类别名称,我不知道从这里开始该怎么做)...从我的研究中,我相信我需要使用CrawlSpider,但不确定会带来什么.还建议我使用BeautifulSoup.任何帮助将不胜感激.

      I barely have any experience with Scrapy and this is what I have so far (just scrapes the category names from the first page, I have no idea what to do from here)... From my research I believe I need to use a CrawlSpider but am unsure of what that entails. I have also been suggested to use BeautifulSoup. Any help would be greatly appreciated.

      class randomSpider(scrapy.Spider):
          name = "helpme"
          allowed_domains = ["example.com"]
          start_urls = ['http://example.com/categories',]
      
          def parse(self, response):
              for i in response.css('div.CategoryTreeSection'):
                  yield {
                      'categories': i.css('a::text').extract_first()
                  }
      

      推荐答案

      不熟悉ElasticSearch,但我会像这样构建刮板:

      Not familiar with ElasticSearch but I'd build a scraper like this:

      class randomSpider(scrapy.Spider):
          name = "helpme"
          allowed_domains = ["example.com"]
          start_urls = ['http://example.com/categories',]
      
          def parse(self, response):
              for i in response.css('div.CategoryTreeSection'):
                  subcategory = i.css('Put your selector here') # This is where you select the subcategory url
                  req = scrapy.Request(subcategory, callback=self.parse_subcategory)
                  req.meta['category'] = i.css('a::text').extract_first()
                  yield req
      
          def parse_subcategory(self, response):
              yield {
                  'category' : response.meta.get('category')
                  'subcategory' : response.css('Put your selector here') # Select the name of the subcategory
                  'subcategorydata' : response.css('Put your selector here') # Select the data of the subcategory
              }
      

      您收集子类别URL并发送请求.此请求的响应将在parse_subcategory中打开.发送此请求时,我们将类别名称添加到元数据中.

      You collect the subcategory URL and send a request. The response of this request will be opened in parse_subcategory. While sending this request, we add the category name in the meta data.

      parse_subcategory函数中,您从元数据中获取类别名称,并从网页中收集子类别数据.

      In the parse_subcategory function you get the category name from the meta data and collect the subcategory data from the webpage.

      这篇关于如何使用Scrapy递归爬网子页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆