Scrapy 的 JSON 输出形成了一个 JSON 对象数组 [英] Scrapy's JSON output forms an array of JSON objects

查看:48
本文介绍了Scrapy 的 JSON 输出形成了一个 JSON 对象数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Scrapy 抓取游戏信息网站.抓取过程是这样的:抓取类别 ->抓取游戏列表(每个类别有多个页面)->抓取游戏信息.抓取的信息应该进入一个 json 文件.我得到以下结果:

I'm trying to scrape a games info website using Scrapy. The scraping process goes like this: scraping the categories -> scraping the list of games (multiple pages for each category) -> scraping game info. The scraped info supposed to go into a json file. I'm getting the following result:

[
    {"category": "cat1", "games": [...]},
    {"category": "cat2", "games": [...]},
    ...
]

但我想得到这个结果:

{ "categories":
    [
        {"category": "cat1", "games": [...]},
        {"category": "cat2", "games": [...]},
        ...
    ]
}

我尝试使用这篇文章这篇文章,没有成功.找不到更多相关问题.

I tried to use the steps from this post and this post, with no success. couldn't find more related questions.

我将不胜感激.

我的蜘蛛:

import scrapy
from ..items import Category, Game

class GamesSpider(scrapy.Spider):
    name = 'games'
    start_urls = ['https://www.example.com/categories']
    base_url = 'https://www.exmple.com'

    def parse(self, response):
        categories = response.xpath("...")

        for category in categories:
            cat_name = category.xpath(".//text()").get()
            url = self.base_url + category.xpath(".//@href").get()    
            
            cat = Category()
            cat['category'] = cat_name
            
            yield response.follow(url=url, 
                                  callback=self.parse_category, 
                                  meta={ 'category': cat })

    def parse_category(self, response):
        games_url_list = response.xpath('//.../a/@href').getall()

        cat = response.meta['category']
        url = self.base_url + games_url_list.pop()
        next_page = response.xpath('//a[...]/@href').get()
        
        if next_page:
            next_page = self.base_url + response.xpath('//a[...]/@href').get()

        yield response.follow(url=url, 
                              callback=self.parse_game, 
                              meta={'category': cat, 
                                    'games_url_list': games_url_list, 
                                    'next_page': next_page})
            
    def parse_game(self, response):
        cat = response.meta['category']
        game = Game()

        try:
            cat['games_list']
        except:
            cat['games_list'] = []
        
        game['title_en'] = response.xpath('...')
        game['os'] = response.xpath('...')
        game['users_rating'] = response.xpath('...')
 
        cat['games_list'].append(game)

        games_url_list = response.meta['games_url_list']
        next_page = response.meta['next_page']
        
        if games_url_list: 
            url = self.base_url + games_url_list.pop()
            yield response.follow(url=url, 
                                  callback=self.parse_game, 
                                  meta={'category': cat, 
                                        'games_url_list': games_url_list, 
                                        'next_page': next_page})

        else:
            if next_page:
                yield response.follow(url=next_page, 
                                      callback=self.parse_category, 
                                      meta={'category': cat})
            else:
                yield cat

我的 item.py 文件:

My item.py file:

import scrapy

class Category(scrapy.Item):
    category = scrapy.Field()
    games_list = scrapy.Field()

class Game(scrapy.Item):
    title_en = scrapy.Field()
    os = scrapy.Field()
    users_rating = scrapy.Field()

推荐答案

你需要编写一个自定义的 item exporter,或者单独处理Scrapy生成的文件的后处理,例如使用独立的 Python 脚本将输出格式转换为所需格式.

You need to write a custom item exporter, or handle post-processing of the file generated by Scrapy separately, e.g. with a standalone Python script that converts from the output format to the desired format.

这篇关于Scrapy 的 JSON 输出形成了一个 JSON 对象数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆