如何从我们正在抓取的网页上的链接网页抓取数据 [英] How to crawl data from the linked webpages on a webpage we are crawling

查看:56
本文介绍了如何从我们正在抓取的网页上的链接网页抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在这个网页上抓取学院的名称,但是,我也想抓取这些学院的学院数量,如果通过点击学院名称打​​开学院的特定网页,则可以获得这些学院的数量.

I am crawling the names of the colleges on this webpage, but, i also want to crawl the number of faculties in these colleges which is available if open the specific webpages of the colleges by clicking the name of the college.

我应该在这段代码中附加什么来获得结果.结果应该是 [(name1,faculty1), (name2,faculty2),...]

What should i append to this code to get the result. The result should be in the form of [(name1, faculty1), (name2,faculty2),... ]

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "student"
    start_urls = [
        'http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-karnataka?sort_filter=alpha',
    ]

    def parse(self, response):
        for students in response.css('li.search-result'):
            yield {
                'name': students.css('div.title a::text').extract(),                   
            }

推荐答案

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "student"
    start_urls = [
        'http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-karnataka?sort_filter=alpha',
    ]

    def parse(self, response):
        for students in response.css('li.search-result'):
            req = scrapy.Request(students.css(SELECT_URL), callback=self.parse_student)
            req.meta['name'] = students.css('div.title a::text').extract()
            yield req

    def parse_student(self, response):
        yield {
            'name': response.meta.get('name')
            'other data': response.css(SELECTOR)
        }

应该是这样的.所以你在请求的元数据中发送学生的名字.这允许您在下一个请求中请求它.

Should be something like this. So you send the name of the student in the meta data of the request. That allows you to request it in your next request.

如果您在 parse_student 中抓取的最后一页上的数据也可用,您可能需要考虑不在元数据中发送它,而只是从最后一页抓取它.

If the data is also available on the last page you scrape in parse_student you might want to consider not sending it in the meta data but just to scrape it from the last page.

这篇关于如何从我们正在抓取的网页上的链接网页抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆