带有scrapy的多个嵌套请求 [英] Multiple nested request with scrapy

查看:22
本文介绍了带有scrapy的多个嵌套请求的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了研究项目,我尝试在 www.flightradar24.com 网站上删除一些飞机时刻表信息.

I try to scrap some airplane schedule information on www.flightradar24.com website for research project.

我想要获取的 json 文件的层次结构是这样的:

The hierarchy of json file i want to obtain is something like that :

Object ID
 - country
   - link
   - name
   - airports
     - airport0 
       - code_total
       - link
       - lat
       - lon
       - name
       - schedule
          - ...
          - ...
      - airport1 
       - code_total
       - link
       - lat
       - lon
       - name
       - schedule
          - ...
          - ...

CountryAirport 使用项目存储,正如您在 json 文件中看到的,CountryItem(链接,名称属性)最终存储多个 AirportItem(code_total、link、lat、lon、name、schedule):

Country and Airport are stored using items, and as you can see on json file the CountryItem (link, name attribute) finally store multiple AirportItem (code_total, link, lat, lon, name, schedule) :

class CountryItem(scrapy.Item):
    name = scrapy.Field()
    link = scrapy.Field()
    airports = scrapy.Field()
    other_url= scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

class AirportItem(scrapy.Item):
    name = scrapy.Field()
    code_little = scrapy.Field()
    code_total = scrapy.Field()
    lat = scrapy.Field()
    lon = scrapy.Field()
    link = scrapy.Field()
    schedule = scrapy.Field()

这里是我的爬虫代码 AirportsSpider 来做到这一点:

Here my scrapy code AirportsSpider to do that :

class AirportsSpider(scrapy.Spider):
    name = "airports"
    start_urls = ['https://www.flightradar24.com/data/airports']
    allowed_domains = ['flightradar24.com']

    def clean_html(self, html_text):
        soup = BeautifulSoup(html_text, 'html.parser')
        return soup.get_text()

    rules = [
    # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LxmlLinkExtractor(allow=('data/airports/',)), callback='parse')
    ]


    def parse(self, response):
        count_country = 0
        countries = []
        for country in response.xpath('//a[@data-country]'):
            if count_country > 5:
                break
            item = CountryItem()
            url =  country.xpath('./@href').extract()
            name = country.xpath('./@title').extract()
            item['link'] = url[0]
            item['name'] = name[0]
            count_country += 1
            countries.append(item)
            yield scrapy.Request(url[0],meta={'my_country_item':item}, callback=self.parse_airports)

    def parse_airports(self,response):
        item = response.meta['my_country_item']
        airports = []

        for airport in response.xpath('//a[@data-iata]'):
            url = airport.xpath('./@href').extract()
            iata = airport.xpath('./@data-iata').extract()
            iatabis = airport.xpath('./small/text()').extract()
            name = ''.join(airport.xpath('./text()').extract()).strip()
            lat = airport.xpath("./@data-lat").extract()
            lon = airport.xpath("./@data-lon").extract()

            iAirport = AirportItem()
            iAirport['name'] = self.clean_html(name)
            iAirport['link'] = url[0]
            iAirport['lat'] = lat[0]
            iAirport['lon'] = lon[0]
            iAirport['code_little'] = iata[0]
            iAirport['code_total'] = iatabis[0]

            airports.append(iAirport)

        for airport in airports:
            json_url = 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin[]=&plugin-setting[schedule][mode]=&plugin-setting[schedule][timestamp]={timestamp}&page=1&limit=50&token='.format(code=airport['code_little'], timestamp="1484150483")
            yield scrapy.Request(json_url, meta={'airport_item': airport}, callback=self.parse_schedule)

        item['airports'] = airports

        yield {"country" : item}

    def parse_schedule(self,response):

        item = response.request.meta['airport_item']
        jsonload = json.loads(response.body_as_unicode())
        json_expression = jmespath.compile("result.response.airport.pluginData.schedule")
        item['schedule'] = json_expression.search(jsonload)

解释:

  • 在我的第一次解析中,我为通过 meta={'my_country_item':item} 创建的 CountryItem 找到的每个国家/地区链接调用请求.每个请求回调 self.parse_airports

  • In my first parse, i call a request on for each country link i found whith the CountryItem created via meta={'my_country_item':item}. Each of these request callback self.parse_airports

在我的第二级解析 parse_airports 中,我捕获了使用 item = response.meta['my_country_item'] 创建的 CountryItem我为我在这个国家/地区页面中找到的每个机场创建了一个新项目 iAirport = AirportItem().现在我想为每个 AirportItem 创建并存储在 airports 列表中获取 schedule 信息.

In my second level of parse parse_airports, i catch CountryItem created using item = response.meta['my_country_item'] and i create a new item iAirport = AirportItem() for each airport i found into this country page. Now i want to get schedule information for each AirportItem created and stored in airports list.

在解析 parse_airports 的第二级中,我在 airports 上运行 for 循环以使用新请求捕获 schedule 信息.因为我想将此时间表信息包含在我的 AirportItem 中,所以我将此项目包含在元信息 meta={'airport_item': airport} 中.此请求的回调运行 parse_schedule

In the second level of parse parse_airports, i run a for loop on airports to catch schedule information using a new Request. Because i want to include this schedule information into my AirportItem, i include this item into meta information meta={'airport_item': airport}. The callback of this request run parse_schedule

在解析parse_schedule的第三级,我将scrapy收集的日程信息注入到之前使用response.request.meta['airport_item']创建的AirportItem中代码>

In the third level of parse parse_schedule, i inject the schedule information collected by scrapy into the AirportItem previously created using response.request.meta['airport_item']

但我的源代码有问题,scrapy 正确地删除了所有信息(国家、机场、时间表),但我对嵌套项目的理解似乎不正确.如您所见,我生成的 json 包含 country >列表(机场),但不是country >列表(机场>时间表)

But i have a problem in my source code, scrapy correctly scrap all the informations (country, airports, schedule), but my comprehension of nested item seems not correct. As you can see the json i produced contain country > list of (airport), but not country > list of (airport > schedule )

我的代码在 github 上:https://github.com/IDEES-Rouen/Flight-报废

My code is on github : https://github.com/IDEES-Rouen/Flight-Scrapping

推荐答案

问题是你分叉了你的物品,根据你的逻辑,你每个国家只想要一个物品,所以你不能在任何时候产生多个物品解析国家后.您要做的是将它们全部堆叠成一个项目.
为此,您需要创建一个解析循环:

The issue is that you fork your item, where according to your logic you only want 1 item per country, so you can't yield mutltiple items at any point after parsing the country. What you want to do is stack all of them into one item.
To do that you need to create a parsing loop:

def parse_airports(self, response):
    item = response.meta['my_country_item']
    item['airports'] = []

    for airport in response.xpath('//a[@data-iata]'):
        url = airport.xpath('./@href').extract()
        iata = airport.xpath('./@data-iata').extract()
        iatabis = airport.xpath('./small/text()').extract()
        name = ''.join(airport.xpath('./text()').extract()).strip()
        lat = airport.xpath("./@data-lat").extract()
        lon = airport.xpath("./@data-lon").extract()

        iAirport = dict()
        iAirport['name'] = 'foobar'
        iAirport['link'] = url[0]
        iAirport['lat'] = lat[0]
        iAirport['lon'] = lon[0]
        iAirport['code_little'] = iata[0]
        iAirport['code_total'] = iatabis[0]
        item['airports'].append(iAirport)

    urls = []
    for airport in item['airports']:
        json_url = 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin[]=&plugin-setting[schedule][mode]=&plugin-setting[schedule][timestamp]={timestamp}&page=1&limit=50&token='.format(
            code=airport['code_little'], timestamp="1484150483")
        urls.append(json_url)
    if not urls:
        return item

    # start with first url
    next_url = urls.pop()
    return Request(next_url, self.parse_schedule,
                   meta={'airport_item': item, 'airport_urls': urls, 'i': 0})

def parse_schedule(self, response):
    """we want to loop this continuously for every schedule item"""
    item = response.meta['airport_item']
    i = response.meta['i']
    urls = response.meta['airport_urls']

    jsonload = json.loads(response.body_as_unicode())
    item['airports'][i]['schedule'] = 'foobar'
    # now do next schedule items
    if not urls:
        yield item
        return
    url = urls.pop()
    yield Request(url, self.parse_schedule,
                  meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})

这篇关于带有scrapy的多个嵌套请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆