Scrapy-创建嵌套的JSON对象 [英] Scrapy - Creating nested JSON Object

查看:108
本文介绍了Scrapy-创建嵌套的JSON对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习如何与Scrapy一起工作,同时刷新有关Python的知识?/学校编码.

I'm learning how to work with Scrapy while refreshing my knowledge in Python?/Coding from school.

目前,我正在玩 imdb排名前250名列表,但在JSON方面苦苦挣扎输出文件.

Currently, I'm playing around with imdb top 250 list but struggling with a JSON output file.

我当前的代码是:

 # -*- coding: utf-8 -*-
import scrapy

from top250imdb.items import Top250ImdbItem


class ActorsSpider(scrapy.Spider):
    name = "actors"
    allowed_domains = ["imdb.com"]
    start_urls = ['http://www.imdb.com/chart/top']

    # Parsing each movie and preparing the url for the actors list
    def parse(self, response):
        for film in response.css('.titleColumn'):
            url = film.css('a::attr(href)').extract_first()
            actors_url = 'http://imdb.com' + url[:17] + 'fullcredits?ref_=tt_cl_sm#cast'
            yield scrapy.Request(actors_url, self.parse_actor)

    # Finding all actors and storing them on item
    # Refer to items.py
    def parse_actor(self, response):
        final_list = []
        item = Top250ImdbItem()
        item['poster'] = response.css('#main img::attr(src)').extract_first()
        item['title'] = response.css('h3[itemprop~=name] a::text').extract()
        item['photo'] = response.css('#fullcredits_content .loadlate::attr(loadlate)').extract()
        item['actors'] = response.css('td[itemprop~=actor] span::text').extract()

        final_list.append(item)

        updated_list = []

        for item in final_list:
            for i in range(len(item['title'])):
                sub_item = {}
                sub_item['movie'] = {}
                sub_item['movie']['poster'] = [item['poster']]
                sub_item['movie']['title'] = [item['title'][i]]
                sub_item['movie']['photo'] = [item['photo']]
                sub_item['movie']['actors'] = [item['actors']]
                updated_list.append(sub_item)
            return updated_list

并且我的输出文件为我提供了这种JSON组合:

and my output file is giving me this JSON composition:

[
  {
    "movie": {
      "poster": ["https://images-na.ssl-images-amazon.com/poster..."], 
      "title": ["The Shawshank Redemption"], 
      "photo": [["https://images-na.ssl-images-amazon.com/photo..."]], 
      "actors": [["Tim Robbins","Morgan Freeman",...]]}
    },{
    "movie": {
      "poster": ["https://images-na.ssl-images-amazon.com/poster..."], 
      "title": ["The Godfather"], 
      "photo": [["https://images-na.ssl-images-amazon.com/photo..."]], 
      "actors": [["Alexandre Rodrigues", "Leandro Firmino", "Phellipe Haagensen",...]]}
  }
]

但是我正在寻求实现:

{
  "movies": [{
    "poster": "https://images-na.ssl-images-amazon.com/poster...",
    "title": "The Shawshank Redemption",
    "actors": [
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Tim Robbins"},
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Morgan Freeman"},...
    ]
  },{
    "poster": "https://images-na.ssl-images-amazon.com/poster...",
    "title": "The Godfather",
    "actors": [
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Marlon Brando"},
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Al Pacino"},...
    ]
  }]
}

在我的items.py文件中,我具有以下内容:

in my items.py file I have the following:

import scrapy


class Top250ImdbItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # Items from actors.py
    poster = scrapy.Field()
    title = scrapy.Field()
    photo = scrapy.Field()
    actors = scrapy.Field()
    movie = scrapy.Field()
    pass

我知道以下情况:

  1. 我的结果没有按顺序显示,网页列表中的第一部电影始终是我的输出文件中的第一部电影,但其余部分则不是.我还在努力.

  1. My results are not coming out in order, the 1st movie on web page list is always the first movie on my output file but the rest is not. I'm still working on that.

我可以做同样的事情,但是使用Top250ImdbItem(),仍然可以更详细地浏览它的完成方式.

I can do the same thing but working with Top250ImdbItem(), still browsing around how that is done in a more detailed way.

对于我的JSON而言,这可能不是完美的布局,建议还是可以接受,即使我知道没有完美的方法或唯一的方法",也请告诉我.

This might not be the perfect layout for my JSON, suggestions are welcomed or if it is, let me know, even though I know there is no perfect way or "the only way".

有些演员没有照片,实际上加载了一个不同的CSS选择器.现在,我想避免获取无图片缩略图",因此可以将这些项目留空.

Some actors don't have a photo and it actually loads a different CSS selector. For now, I would like to avoid reaching for the "no picture thumbnail" so it's ok to leave those items empty.

示例:

{"photo": "", "name": "Al Pacino"}

推荐答案

问题:...正在处理JSON输出文件

Question: ... struggling with a JSON output file


注意:无法使用您的 ActorSpider ,得到错误:不支持伪元素.

Note: Can't use your ActorsSpider, get Error: Pseudo-elements are not supported.

# Define a `dict` **once**
top250ImdbItem = {'movies': []}

def parse_actor(self, response):
    poster = response.css(...
    title = response.css(...
    photos = response.css(...
    actors = response.css(...

    # Assuming List of Actors are in sync with List of Photos
    actors_list = []
    for i, actor in enumerate(actors):
        actors_list.append({"name": actor, "photo": photos[i]})

    one_movie = {"poster": poster,
                 "title": title,
                 "actors": actors_list
                }

    # Append One Movie to Top250 'movies' List
    top250ImdbItem['movies'].append(one_movie)

这篇关于Scrapy-创建嵌套的JSON对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆