使用多个解析创建 Scrapy 项目数组 [英] Creating Scrapy array of items with multiple parse
问题描述
我正在使用 Scrapy 抓取列表.我的脚本首先使用 parse_node
解析列表 URL,然后使用 parse_listing
解析每个列表,对于每个列表,它使用 parse_agent解析列表的代理代码>.我想创建一个数组,该数组通过列表和列表的代理进行scrapy 解析,并为每个新列表进行重置.
I am scraping listings with Scrapy. My script parses first for the listing urls using parse_node
, then it parses each listing using parse_listing
, for each listing it parses the agents for the listing using parse_agent
. I would like to create an array, that builds up as scrapy parses through the listings and the agents for the listings and resets for each new listing.
这是我的解析脚本:
def parse_node(self,response,node):
yield Request('LISTING LINK',callback=self.parse_listing)
def parse_listing(self,response):
yield response.xpath('//node[@id="ListingId"]/text()').extract_first()
yield response.xpath('//node[@id="ListingTitle"]/text()').extract_first()
for agent in string.split(response.xpath('//node[@id="Agents"]/text()').extract_first() or "",'^'):
yield Request('AGENT LINK',callback=self.parse_agent)
def parse_agent(self,response):
yield response.xpath('//node[@id="AgentName"]/text()').extract_first()
yield response.xpath('//node[@id="AgentEmail"]/text()').extract_first()
我希望 parse_listing 导致:
I would like parse_listing to result in:
{
'id':123,
'title':'Amazing Listing'
}
然后将 parse_agent 添加到列表数组中:
then parse_agent to add to the listing array:
{
'id':123,
'title':'Amazing Listing'
'agent':[
{
'name':'jon doe',
'email:'jon.doe@email.com'
},
{
'name':'jane doe',
'email:'jane.doe@email.com'
}
]
}
如何获取每个级别的结果并构建数组?
How do I get the results from each level and build up an array?
推荐答案
这个有点复杂的发布:
您需要从多个不同的网址形成一个项目.
This is somewhat complicated issued:
You need to form a single item from multiple different urls.
Scrapy 允许您在请求的元属性中携带数据,以便您可以执行以下操作:
Scrapy allows you to carry over data in request's meta attribute so you can do something like:
def parse_node(self,response,node):
yield Request('LISTING LINK', callback=self.parse_listing)
def parse_listing(self,response):
item = defaultdict(list)
item['id'] = response.xpath('//node[@id="ListingId"]/text()').extract_first()
item['title'] = response.xpath('//node[@id="ListingTitle"]/text()').extract_first()
agent_urls = string.split(response.xpath('//node[@id="Agents"]/text()').extract_first() or "",'^')
# find all agent urls and start with first one
url = agent_urls.pop(0)
# we want to go through agent urls one-by-one and update single item with agent data
yield Request(url, callback=self.parse_agent,
meta={'item': item, 'agent_urls' agent_urls})
def parse_agent(self,response):
item = response.meta['item'] # retrieve item generated in previous request
agent = dict()
agent['name'] = response.xpath('//node[@id="AgentName"]/text()').extract_first()
agent['email'] = response.xpath('//node[@id="AgentEmail"]/text()').extract_first()
item['agents'].append(agent)
# check if we have any more agent urls left
agent_urls = response.meta['agent_urls']
if not agent_urls: # we crawled all of the agents!
return item
# if we do - crawl next agent and carry over our current item
url = agent_urls.pop(0)
yield Request(url, callback=self.parse_agent,
meta={'item': item, 'agent_urls' agent_urls})
这篇关于使用多个解析创建 Scrapy 项目数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!