如何从多个页面将数据收集到单个数据结构中 [英] How to collect data from multiple pages into single data structure with scrapy

查看:117
本文介绍了如何从多个页面将数据收集到单个数据结构中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从站点中抓取数据.数据被构造为多个对象,每个对象都有一组数据. 例如,具有姓名,年龄和职业的人.

I am trying to scrape data from a site.The data is structured as multiple objects each with a set of data. For example, people with names, ages, and occupations.

我的问题是该数据在网站中分为两个级别.
例如,第一页是姓名和年龄的列表,并带有指向每个人的个人资料页面的链接.
他们的个人资料页面列出了他们的职业.

My problem is that this data is split across two levels in the website.
The first page is, say, a list of names and ages with a link to each persons profile page.
Their profile page lists their occupation.

我已经有一个用python编写的爬虫,可以从顶层收集数据并通过多个分页进行爬网.
但是,如何在保持与页面链接的同时从内部页面收集数据呢?合适的对象?

I already have a spider written with scrapy in python which can collect the data from the top layer and crawl through multiple paginations.
But, how can I collect the data from the inner pages while keeping it linked to the appropriate object?

当前,我将输出结构化为json

Currently, I have the output structured with json as

   {[name='name',age='age',occupation='occupation'],
   [name='name',age='age',occupation='occupation']} etc

解析功能可以跨这样的页面到达吗?

Can the parse function reach across pages like that?

推荐答案

这是您需要处理的一种方法.当商品具有所有属性时,您需要产生/退回商品

here is a way you need to deal. you need to yield/return item once when item has all attributes

yield Request(page1,
              callback=self.page1_data)

def page1_data(self, response):
    hxs = HtmlXPathSelector(response)
    i = TestItem()
    i['name']='name'
    i['age']='age'
    url_profile_page = 'url to the profile page'

    yield Request(url_profile_page,
                  meta={'item':i},
    callback=self.profile_page)


def profile_page(self,response):
    hxs = HtmlXPathSelector(response)
    old_item=response.request.meta['item']
    # parse other fileds
    # assign them to old_item

    yield old_item

这篇关于如何从多个页面将数据收集到单个数据结构中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆