多个嵌套请求与scrapy [英] Multiple nested request with scrapy

查看:93
本文介绍了多个嵌套请求与scrapy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试在www.flightradar24.com网站上删除一些飞机时间表信息以用于研究项目.

I try to scrap some airplane schedule information on www.flightradar24.com website for research project.

我想要获取的json文件的层次结构是这样的:

The hierarchy of json file i want to obtain is something like that :

Object ID
 - country
   - link
   - name
   - airports
     - airport0 
       - code_total
       - link
       - lat
       - lon
       - name
       - schedule
          - ...
          - ...
      - airport1 
       - code_total
       - link
       - lat
       - lon
       - name
       - schedule
          - ...
          - ...

CountryAirport是使用项目存储的,正如您在json文件中看到的那样,CountryItem(链接,名称属性)最终存储了多个AirportItem(代码总数,链接,纬度,经度,名称,时间表):

Country and Airport are stored using items, and as you can see on json file the CountryItem (link, name attribute) finally store multiple AirportItem (code_total, link, lat, lon, name, schedule) :

class CountryItem(scrapy.Item):
    name = scrapy.Field()
    link = scrapy.Field()
    airports = scrapy.Field()
    other_url= scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

class AirportItem(scrapy.Item):
    name = scrapy.Field()
    code_little = scrapy.Field()
    code_total = scrapy.Field()
    lat = scrapy.Field()
    lon = scrapy.Field()
    link = scrapy.Field()
    schedule = scrapy.Field()

这是我的拼凑代码AirportsSpider:

class AirportsSpider(scrapy.Spider):
    name = "airports"
    start_urls = ['https://www.flightradar24.com/data/airports']
    allowed_domains = ['flightradar24.com']

    def clean_html(self, html_text):
        soup = BeautifulSoup(html_text, 'html.parser')
        return soup.get_text()

    rules = [
    # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LxmlLinkExtractor(allow=('data/airports/',)), callback='parse')
    ]


    def parse(self, response):
        count_country = 0
        countries = []
        for country in response.xpath('//a[@data-country]'):
            if count_country > 5:
                break
            item = CountryItem()
            url =  country.xpath('./@href').extract()
            name = country.xpath('./@title').extract()
            item['link'] = url[0]
            item['name'] = name[0]
            count_country += 1
            countries.append(item)
            yield scrapy.Request(url[0],meta={'my_country_item':item}, callback=self.parse_airports)

    def parse_airports(self,response):
        item = response.meta['my_country_item']
        airports = []

        for airport in response.xpath('//a[@data-iata]'):
            url = airport.xpath('./@href').extract()
            iata = airport.xpath('./@data-iata').extract()
            iatabis = airport.xpath('./small/text()').extract()
            name = ''.join(airport.xpath('./text()').extract()).strip()
            lat = airport.xpath("./@data-lat").extract()
            lon = airport.xpath("./@data-lon").extract()

            iAirport = AirportItem()
            iAirport['name'] = self.clean_html(name)
            iAirport['link'] = url[0]
            iAirport['lat'] = lat[0]
            iAirport['lon'] = lon[0]
            iAirport['code_little'] = iata[0]
            iAirport['code_total'] = iatabis[0]

            airports.append(iAirport)

        for airport in airports:
            json_url = 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page=1&limit=50&token='.format(code=airport['code_little'], timestamp="1484150483")
            yield scrapy.Request(json_url, meta={'airport_item': airport}, callback=self.parse_schedule)

        item['airports'] = airports

        yield {"country" : item}

    def parse_schedule(self,response):

        item = response.request.meta['airport_item']
        jsonload = json.loads(response.body_as_unicode())
        json_expression = jmespath.compile("result.response.airport.pluginData.schedule")
        item['schedule'] = json_expression.search(jsonload)

说明:

  • 在我的第一次解析中,我为通过meta={'my_country_item':item}创建的CountryItem找到的每个国家/地区链接调用一个请求.每个请求回调self.parse_airports

  • In my first parse, i call a request on for each country link i found whith the CountryItem created via meta={'my_country_item':item}. Each of these request callback self.parse_airports

在第二级解析parse_airports中,我捕获了使用item = response.meta['my_country_item']创建的CountryItem,并且为在此国家/地区页面中找到的每个机场创建了一个新项目iAirport = AirportItem().现在,我想获取创建并存储在airports列表中的每个AirportItemschedule信息.

In my second level of parse parse_airports, i catch CountryItem created using item = response.meta['my_country_item'] and i create a new item iAirport = AirportItem() for each airport i found into this country page. Now i want to get schedule information for each AirportItem created and stored in airports list.

在解析parse_airports的第二级中,我在airports上运行for循环以使用新的Request捕获schedule信息.因为我想将此计划信息包含在我的AirportItem中,所以我将此项目包含在元信息meta={'airport_item': airport}中.该请求的回调运行parse_schedule

In the second level of parse parse_airports, i run a for loop on airports to catch schedule information using a new Request. Because i want to include this schedule information into my AirportItem, i include this item into meta information meta={'airport_item': airport}. The callback of this request run parse_schedule

在第三级解析parse_schedule中,我将通过scrapy收集的计划信息注入到先前使用response.request.meta['airport_item']

In the third level of parse parse_schedule, i inject the schedule information collected by scrapy into the AirportItem previously created using response.request.meta['airport_item']

但是我的源代码有问题,正确地抓取所有信息(国家,机场,时间表),但是我对嵌套项目的理解似乎不正确.如您所见,我生成的json包含country > list of (airport),但不包含country > list of (airport > schedule )

But i have a problem in my source code, scrapy correctly scrap all the informations (country, airports, schedule), but my comprehension of nested item seems not correct. As you can see the json i produced contain country > list of (airport), but not country > list of (airport > schedule )

我的代码在github上: https://github.com/IDEES-Rouen/Flight -报废

My code is on github : https://github.com/IDEES-Rouen/Flight-Scrapping

推荐答案

问题是您分叉商品,根据您的逻辑,每个国家/地区只需要一件商品,因此您在任何时候都无法生产多件商品解析国家之后.您要做的就是将它们全部堆叠到一个项目中.
为此,您需要创建一个解析循环:

The issue is that you fork your item, where according to your logic you only want 1 item per country, so you can't yield mutltiple items at any point after parsing the country. What you want to do is stack all of them into one item.
To do that you need to create a parsing loop:

def parse_airports(self, response):
    item = response.meta['my_country_item']
    item['airports'] = []

    for airport in response.xpath('//a[@data-iata]'):
        url = airport.xpath('./@href').extract()
        iata = airport.xpath('./@data-iata').extract()
        iatabis = airport.xpath('./small/text()').extract()
        name = ''.join(airport.xpath('./text()').extract()).strip()
        lat = airport.xpath("./@data-lat").extract()
        lon = airport.xpath("./@data-lon").extract()

        iAirport = dict()
        iAirport['name'] = 'foobar'
        iAirport['link'] = url[0]
        iAirport['lat'] = lat[0]
        iAirport['lon'] = lon[0]
        iAirport['code_little'] = iata[0]
        iAirport['code_total'] = iatabis[0]
        item['airports'].append(iAirport)

    urls = []
    for airport in item['airports']:
        json_url = 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page=1&limit=50&token='.format(
            code=airport['code_little'], timestamp="1484150483")
        urls.append(json_url)
    if not urls:
        return item

    # start with first url
    next_url = urls.pop()
    return Request(next_url, self.parse_schedule,
                   meta={'airport_item': item, 'airport_urls': urls, 'i': 0})

def parse_schedule(self, response):
    """we want to loop this continuously for every schedule item"""
    item = response.meta['airport_item']
    i = response.meta['i']
    urls = response.meta['airport_urls']

    jsonload = json.loads(response.body_as_unicode())
    item['airports'][i]['schedule'] = 'foobar'
    # now do next schedule items
    if not urls:
        yield item
        return
    url = urls.pop()
    yield Request(url, self.parse_schedule,
                  meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})

这篇关于多个嵌套请求与scrapy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆