带有scrapy的多个嵌套请求 [英] Multiple nested request with scrapy
问题描述
为了研究项目,我尝试在 www.flightradar24.com 网站上删除一些飞机时刻表信息.
I try to scrap some airplane schedule information on www.flightradar24.com website for research project.
我想要获取的 json 文件的层次结构是这样的:
The hierarchy of json file i want to obtain is something like that :
Object ID
- country
- link
- name
- airports
- airport0
- code_total
- link
- lat
- lon
- name
- schedule
- ...
- ...
- airport1
- code_total
- link
- lat
- lon
- name
- schedule
- ...
- ...
Country
和 Airport
使用项目存储,正如您在 json 文件中看到的,CountryItem
(链接,名称属性)最终存储多个 AirportItem
(code_total、link、lat、lon、name、schedule):
Country
and Airport
are stored using items, and as you can see on json file the CountryItem
(link, name attribute) finally store multiple AirportItem
(code_total, link, lat, lon, name, schedule) :
class CountryItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
airports = scrapy.Field()
other_url= scrapy.Field()
last_updated = scrapy.Field(serializer=str)
class AirportItem(scrapy.Item):
name = scrapy.Field()
code_little = scrapy.Field()
code_total = scrapy.Field()
lat = scrapy.Field()
lon = scrapy.Field()
link = scrapy.Field()
schedule = scrapy.Field()
这里是我的爬虫代码 AirportsSpider
来做到这一点:
Here my scrapy code AirportsSpider
to do that :
class AirportsSpider(scrapy.Spider):
name = "airports"
start_urls = ['https://www.flightradar24.com/data/airports']
allowed_domains = ['flightradar24.com']
def clean_html(self, html_text):
soup = BeautifulSoup(html_text, 'html.parser')
return soup.get_text()
rules = [
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LxmlLinkExtractor(allow=('data/airports/',)), callback='parse')
]
def parse(self, response):
count_country = 0
countries = []
for country in response.xpath('//a[@data-country]'):
if count_country > 5:
break
item = CountryItem()
url = country.xpath('./@href').extract()
name = country.xpath('./@title').extract()
item['link'] = url[0]
item['name'] = name[0]
count_country += 1
countries.append(item)
yield scrapy.Request(url[0],meta={'my_country_item':item}, callback=self.parse_airports)
def parse_airports(self,response):
item = response.meta['my_country_item']
airports = []
for airport in response.xpath('//a[@data-iata]'):
url = airport.xpath('./@href').extract()
iata = airport.xpath('./@data-iata').extract()
iatabis = airport.xpath('./small/text()').extract()
name = ''.join(airport.xpath('./text()').extract()).strip()
lat = airport.xpath("./@data-lat").extract()
lon = airport.xpath("./@data-lon").extract()
iAirport = AirportItem()
iAirport['name'] = self.clean_html(name)
iAirport['link'] = url[0]
iAirport['lat'] = lat[0]
iAirport['lon'] = lon[0]
iAirport['code_little'] = iata[0]
iAirport['code_total'] = iatabis[0]
airports.append(iAirport)
for airport in airports:
json_url = 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin[]=&plugin-setting[schedule][mode]=&plugin-setting[schedule][timestamp]={timestamp}&page=1&limit=50&token='.format(code=airport['code_little'], timestamp="1484150483")
yield scrapy.Request(json_url, meta={'airport_item': airport}, callback=self.parse_schedule)
item['airports'] = airports
yield {"country" : item}
def parse_schedule(self,response):
item = response.request.meta['airport_item']
jsonload = json.loads(response.body_as_unicode())
json_expression = jmespath.compile("result.response.airport.pluginData.schedule")
item['schedule'] = json_expression.search(jsonload)
解释:
在我的第一次解析中,我为通过
meta={'my_country_item':item}
创建的CountryItem
找到的每个国家/地区链接调用请求.每个请求回调self.parse_airports
In my first parse, i call a request on for each country link i found whith the
CountryItem
created viameta={'my_country_item':item}
. Each of these request callbackself.parse_airports
在我的第二级解析 parse_airports
中,我捕获了使用 item = response.meta['my_country_item']
创建的 CountryItem
我为我在这个国家/地区页面中找到的每个机场创建了一个新项目 iAirport = AirportItem()
.现在我想为每个 AirportItem
创建并存储在 airports
列表中获取 schedule
信息.
In my second level of parse parse_airports
, i catch CountryItem
created using item = response.meta['my_country_item']
and i create a new item iAirport = AirportItem()
for each airport i found into this country page. Now i want to get schedule
information for each AirportItem
created and stored in airports
list.
在解析 parse_airports
的第二级中,我在 airports
上运行 for 循环以使用新请求捕获 schedule
信息.因为我想将此时间表信息包含在我的 AirportItem 中,所以我将此项目包含在元信息 meta={'airport_item': airport}
中.此请求的回调运行 parse_schedule
In the second level of parse parse_airports
, i run a for loop on airports
to catch schedule
information using a new Request. Because i want to include this schedule information into my AirportItem, i include this item into meta information meta={'airport_item': airport}
. The callback of this request run parse_schedule
在解析parse_schedule
的第三级,我将scrapy收集的日程信息注入到之前使用response.request.meta['airport_item']创建的AirportItem中
代码>
In the third level of parse parse_schedule
, i inject the schedule information collected by scrapy into the AirportItem previously created using response.request.meta['airport_item']
但我的源代码有问题,scrapy 正确地删除了所有信息(国家、机场、时间表),但我对嵌套项目的理解似乎不正确.如您所见,我生成的 json 包含 country >列表(机场)
,但不是country >列表(机场>时间表)
But i have a problem in my source code, scrapy correctly scrap all the informations (country, airports, schedule), but my comprehension of nested item seems not correct. As you can see the json i produced contain country > list of (airport)
, but not country > list of (airport > schedule )
我的代码在 github 上:https://github.com/IDEES-Rouen/Flight-报废
My code is on github : https://github.com/IDEES-Rouen/Flight-Scrapping
推荐答案
问题是你分叉了你的物品,根据你的逻辑,你每个国家只想要一个物品,所以你不能在任何时候产生多个物品解析国家后.您要做的是将它们全部堆叠成一个项目.
为此,您需要创建一个解析循环:
The issue is that you fork your item, where according to your logic you only want 1 item per country, so you can't yield mutltiple items at any point after parsing the country. What you want to do is stack all of them into one item.
To do that you need to create a parsing loop:
def parse_airports(self, response):
item = response.meta['my_country_item']
item['airports'] = []
for airport in response.xpath('//a[@data-iata]'):
url = airport.xpath('./@href').extract()
iata = airport.xpath('./@data-iata').extract()
iatabis = airport.xpath('./small/text()').extract()
name = ''.join(airport.xpath('./text()').extract()).strip()
lat = airport.xpath("./@data-lat").extract()
lon = airport.xpath("./@data-lon").extract()
iAirport = dict()
iAirport['name'] = 'foobar'
iAirport['link'] = url[0]
iAirport['lat'] = lat[0]
iAirport['lon'] = lon[0]
iAirport['code_little'] = iata[0]
iAirport['code_total'] = iatabis[0]
item['airports'].append(iAirport)
urls = []
for airport in item['airports']:
json_url = 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin[]=&plugin-setting[schedule][mode]=&plugin-setting[schedule][timestamp]={timestamp}&page=1&limit=50&token='.format(
code=airport['code_little'], timestamp="1484150483")
urls.append(json_url)
if not urls:
return item
# start with first url
next_url = urls.pop()
return Request(next_url, self.parse_schedule,
meta={'airport_item': item, 'airport_urls': urls, 'i': 0})
def parse_schedule(self, response):
"""we want to loop this continuously for every schedule item"""
item = response.meta['airport_item']
i = response.meta['i']
urls = response.meta['airport_urls']
jsonload = json.loads(response.body_as_unicode())
item['airports'][i]['schedule'] = 'foobar'
# now do next schedule items
if not urls:
yield item
return
url = urls.pop()
yield Request(url, self.parse_schedule,
meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})
这篇关于带有scrapy的多个嵌套请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!