如何使用Scrapy递归爬网子页面 [英] How to recursively crawl subpages with Scrapy

查看：168 发布时间：2020/9/20 7:22:35 python beautifulsoup scrapy web-crawler scrapy-spider

本文介绍了如何使用Scrapy递归爬网子页面的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

因此，基本上，我试图爬网具有一组类别的页面，抓取每个类别的名称，跟随与每个类别关联的子链接到具有一组子类别的页面，抓取它们的名称，然后跟随每个子类别到其关联页面并检索文本数据.最后，我想输出一个格式如下的json文件:

So basically I am trying to crawl a page with a set of categories, scrape the names of each category, follow a sublink associated with each category to a page with a set of subcategories, scrape their names, and then follow each subcategory to their associated page and retrieve text data. At the end I want to output a json file formatted somewhat like:

类别1名称
- 子类别1名称
  - 该子类别页面上的数据

Category 1 name
- Subcategory 1 name
  - data from this subcategory's page

此页面上的数据

子类别1名称
- n子类别页面中的数据
- Subcategory 1 name
  - data from subcategory n's page
  等
  
  最终我希望能够在ElasticSearch中使用此数据
  
  Eventually i want to be able to use this data with ElasticSearch
  
  我几乎没有Scrapy的经验，这是我到目前为止的经验(只是从第一页抓取类别名称，我不知道从这里开始该怎么做)...从我的研究中，我相信我需要使用CrawlSpider，但不确定会带来什么.还建议我使用BeautifulSoup.任何帮助将不胜感激.
  
  I barely have any experience with Scrapy and this is what I have so far (just scrapes the category names from the first page, I have no idea what to do from here)... From my research I believe I need to use a CrawlSpider but am unsure of what that entails. I have also been suggested to use BeautifulSoup. Any help would be greatly appreciated.
```
class randomSpider(scrapy.Spider):
    name = "helpme"
    allowed_domains = ["example.com"]
    start_urls = ['http://example.com/categories',]

    def parse(self, response):
        for i in response.css('div.CategoryTreeSection'):
            yield {
                'categories': i.css('a::text').extract_first()
            }
```
  推荐答案
  
  不熟悉ElasticSearch，但我会像这样构建刮板:
  
  Not familiar with ElasticSearch but I'd build a scraper like this:
```
class randomSpider(scrapy.Spider):
    name = "helpme"
    allowed_domains = ["example.com"]
    start_urls = ['http://example.com/categories',]

    def parse(self, response):
        for i in response.css('div.CategoryTreeSection'):
            subcategory = i.css('Put your selector here') # This is where you select the subcategory url
            req = scrapy.Request(subcategory, callback=self.parse_subcategory)
            req.meta['category'] = i.css('a::text').extract_first()
            yield req

    def parse_subcategory(self, response):
        yield {
            'category' : response.meta.get('category')
            'subcategory' : response.css('Put your selector here') # Select the name of the subcategory
            'subcategorydata' : response.css('Put your selector here') # Select the data of the subcategory
        }
```
  您收集子类别URL并发送请求.此请求的响应将在parse_subcategory中打开.发送此请求时，我们将类别名称添加到元数据中.
  
  You collect the subcategory URL and send a request. The response of this request will be opened in parse_subcategory. While sending this request, we add the category name in the meta data.
  
  在parse_subcategory函数中，您从元数据中获取类别名称，并从网页中收集子类别数据.
  
  In the parse_subcategory function you get the category name from the meta data and collect the subcategory data from the webpage.
  
  这篇关于如何使用Scrapy递归爬网子页面的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用Scrapy递归爬网子页面 [英] How to recursively crawl subpages with Scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何使用Scrapy递归爬网子页面 [英] How to recursively crawl subpages with Scrapy

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭