多个嵌套请求与scrapy [英] Multiple nested request with scrapy
问题描述
我尝试在www.flightradar24.com网站上删除一些飞机时间表信息以用于研究项目.
I try to scrap some airplane schedule information on www.flightradar24.com website for research project.
我想要获取的json文件的层次结构是这样的:
The hierarchy of json file i want to obtain is something like that :
Object ID
- country
- link
- name
- airports
- airport0
- code_total
- link
- lat
- lon
- name
- schedule
- ...
- ...
- airport1
- code_total
- link
- lat
- lon
- name
- schedule
- ...
- ...
Country
和Airport
是使用项目存储的,正如您在json文件中看到的那样,CountryItem
(链接,名称属性)最终存储了多个AirportItem
(代码总数,链接,纬度,经度,名称,时间表):
Country
and Airport
are stored using items, and as you can see on json file the CountryItem
(link, name attribute) finally store multiple AirportItem
(code_total, link, lat, lon, name, schedule) :
class CountryItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
airports = scrapy.Field()
other_url= scrapy.Field()
last_updated = scrapy.Field(serializer=str)
class AirportItem(scrapy.Item):
name = scrapy.Field()
code_little = scrapy.Field()
code_total = scrapy.Field()
lat = scrapy.Field()
lon = scrapy.Field()
link = scrapy.Field()
schedule = scrapy.Field()
这是我的拼凑代码AirportsSpider
:
class AirportsSpider(scrapy.Spider):
name = "airports"
start_urls = ['https://www.flightradar24.com/data/airports']
allowed_domains = ['flightradar24.com']
def clean_html(self, html_text):
soup = BeautifulSoup(html_text, 'html.parser')
return soup.get_text()
rules = [
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LxmlLinkExtractor(allow=('data/airports/',)), callback='parse')
]
def parse(self, response):
count_country = 0
countries = []
for country in response.xpath('//a[@data-country]'):
if count_country > 5:
break
item = CountryItem()
url = country.xpath('./@href').extract()
name = country.xpath('./@title').extract()
item['link'] = url[0]
item['name'] = name[0]
count_country += 1
countries.append(item)
yield scrapy.Request(url[0],meta={'my_country_item':item}, callback=self.parse_airports)
def parse_airports(self,response):
item = response.meta['my_country_item']
airports = []
for airport in response.xpath('//a[@data-iata]'):
url = airport.xpath('./@href').extract()
iata = airport.xpath('./@data-iata').extract()
iatabis = airport.xpath('./small/text()').extract()
name = ''.join(airport.xpath('./text()').extract()).strip()
lat = airport.xpath("./@data-lat").extract()
lon = airport.xpath("./@data-lon").extract()
iAirport = AirportItem()
iAirport['name'] = self.clean_html(name)
iAirport['link'] = url[0]
iAirport['lat'] = lat[0]
iAirport['lon'] = lon[0]
iAirport['code_little'] = iata[0]
iAirport['code_total'] = iatabis[0]
airports.append(iAirport)
for airport in airports:
json_url = 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page=1&limit=50&token='.format(code=airport['code_little'], timestamp="1484150483")
yield scrapy.Request(json_url, meta={'airport_item': airport}, callback=self.parse_schedule)
item['airports'] = airports
yield {"country" : item}
def parse_schedule(self,response):
item = response.request.meta['airport_item']
jsonload = json.loads(response.body_as_unicode())
json_expression = jmespath.compile("result.response.airport.pluginData.schedule")
item['schedule'] = json_expression.search(jsonload)
说明:
-
在我的第一次解析中,我为通过
meta={'my_country_item':item}
创建的CountryItem
找到的每个国家/地区链接调用一个请求.每个请求回调self.parse_airports
In my first parse, i call a request on for each country link i found whith the
CountryItem
created viameta={'my_country_item':item}
. Each of these request callbackself.parse_airports
在第二级解析parse_airports
中,我捕获了使用item = response.meta['my_country_item']
创建的CountryItem
,并且为在此国家/地区页面中找到的每个机场创建了一个新项目iAirport = AirportItem()
.现在,我想获取创建并存储在airports
列表中的每个AirportItem
的schedule
信息.
In my second level of parse parse_airports
, i catch CountryItem
created using item = response.meta['my_country_item']
and i create a new item iAirport = AirportItem()
for each airport i found into this country page. Now i want to get schedule
information for each AirportItem
created and stored in airports
list.
在解析parse_airports
的第二级中,我在airports
上运行for循环以使用新的Request捕获schedule
信息.因为我想将此计划信息包含在我的AirportItem中,所以我将此项目包含在元信息meta={'airport_item': airport}
中.该请求的回调运行parse_schedule
In the second level of parse parse_airports
, i run a for loop on airports
to catch schedule
information using a new Request. Because i want to include this schedule information into my AirportItem, i include this item into meta information meta={'airport_item': airport}
. The callback of this request run parse_schedule
在第三级解析parse_schedule
中,我将通过scrapy收集的计划信息注入到先前使用response.request.meta['airport_item']
In the third level of parse parse_schedule
, i inject the schedule information collected by scrapy into the AirportItem previously created using response.request.meta['airport_item']
但是我的源代码有问题,正确地抓取所有信息(国家,机场,时间表),但是我对嵌套项目的理解似乎不正确.如您所见,我生成的json包含country > list of (airport)
,但不包含country > list of (airport > schedule )
But i have a problem in my source code, scrapy correctly scrap all the informations (country, airports, schedule), but my comprehension of nested item seems not correct. As you can see the json i produced contain country > list of (airport)
, but not country > list of (airport > schedule )
我的代码在github上: https://github.com/IDEES-Rouen/Flight -报废
My code is on github : https://github.com/IDEES-Rouen/Flight-Scrapping
推荐答案
问题是您分叉商品,根据您的逻辑,每个国家/地区只需要一件商品,因此您在任何时候都无法生产多件商品解析国家之后.您要做的就是将它们全部堆叠到一个项目中.
为此,您需要创建一个解析循环:
The issue is that you fork your item, where according to your logic you only want 1 item per country, so you can't yield mutltiple items at any point after parsing the country. What you want to do is stack all of them into one item.
To do that you need to create a parsing loop:
def parse_airports(self, response):
item = response.meta['my_country_item']
item['airports'] = []
for airport in response.xpath('//a[@data-iata]'):
url = airport.xpath('./@href').extract()
iata = airport.xpath('./@data-iata').extract()
iatabis = airport.xpath('./small/text()').extract()
name = ''.join(airport.xpath('./text()').extract()).strip()
lat = airport.xpath("./@data-lat").extract()
lon = airport.xpath("./@data-lon").extract()
iAirport = dict()
iAirport['name'] = 'foobar'
iAirport['link'] = url[0]
iAirport['lat'] = lat[0]
iAirport['lon'] = lon[0]
iAirport['code_little'] = iata[0]
iAirport['code_total'] = iatabis[0]
item['airports'].append(iAirport)
urls = []
for airport in item['airports']:
json_url = 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page=1&limit=50&token='.format(
code=airport['code_little'], timestamp="1484150483")
urls.append(json_url)
if not urls:
return item
# start with first url
next_url = urls.pop()
return Request(next_url, self.parse_schedule,
meta={'airport_item': item, 'airport_urls': urls, 'i': 0})
def parse_schedule(self, response):
"""we want to loop this continuously for every schedule item"""
item = response.meta['airport_item']
i = response.meta['i']
urls = response.meta['airport_urls']
jsonload = json.loads(response.body_as_unicode())
item['airports'][i]['schedule'] = 'foobar'
# now do next schedule items
if not urls:
yield item
return
url = urls.pop()
yield Request(url, self.parse_schedule,
meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})
这篇关于多个嵌套请求与scrapy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!