Scrapy 的 JSON 输出形成了一个 JSON 对象数组 [英] Scrapy's JSON output forms an array of JSON objects
问题描述
我正在尝试使用 Scrapy 抓取游戏信息网站.抓取过程是这样的:抓取类别 ->抓取游戏列表(每个类别有多个页面)->抓取游戏信息.抓取的信息应该进入一个 json 文件.我得到以下结果:
I'm trying to scrape a games info website using Scrapy. The scraping process goes like this: scraping the categories -> scraping the list of games (multiple pages for each category) -> scraping game info. The scraped info supposed to go into a json file. I'm getting the following result:
[
{"category": "cat1", "games": [...]},
{"category": "cat2", "games": [...]},
...
]
但我想得到这个结果:
{ "categories":
[
{"category": "cat1", "games": [...]},
{"category": "cat2", "games": [...]},
...
]
}
我尝试使用这篇文章和这篇文章,没有成功.找不到更多相关问题.
I tried to use the steps from this post and this post, with no success. couldn't find more related questions.
我将不胜感激.
我的蜘蛛:
import scrapy
from ..items import Category, Game
class GamesSpider(scrapy.Spider):
name = 'games'
start_urls = ['https://www.example.com/categories']
base_url = 'https://www.exmple.com'
def parse(self, response):
categories = response.xpath("...")
for category in categories:
cat_name = category.xpath(".//text()").get()
url = self.base_url + category.xpath(".//@href").get()
cat = Category()
cat['category'] = cat_name
yield response.follow(url=url,
callback=self.parse_category,
meta={ 'category': cat })
def parse_category(self, response):
games_url_list = response.xpath('//.../a/@href').getall()
cat = response.meta['category']
url = self.base_url + games_url_list.pop()
next_page = response.xpath('//a[...]/@href').get()
if next_page:
next_page = self.base_url + response.xpath('//a[...]/@href').get()
yield response.follow(url=url,
callback=self.parse_game,
meta={'category': cat,
'games_url_list': games_url_list,
'next_page': next_page})
def parse_game(self, response):
cat = response.meta['category']
game = Game()
try:
cat['games_list']
except:
cat['games_list'] = []
game['title_en'] = response.xpath('...')
game['os'] = response.xpath('...')
game['users_rating'] = response.xpath('...')
cat['games_list'].append(game)
games_url_list = response.meta['games_url_list']
next_page = response.meta['next_page']
if games_url_list:
url = self.base_url + games_url_list.pop()
yield response.follow(url=url,
callback=self.parse_game,
meta={'category': cat,
'games_url_list': games_url_list,
'next_page': next_page})
else:
if next_page:
yield response.follow(url=next_page,
callback=self.parse_category,
meta={'category': cat})
else:
yield cat
我的 item.py 文件:
My item.py file:
import scrapy
class Category(scrapy.Item):
category = scrapy.Field()
games_list = scrapy.Field()
class Game(scrapy.Item):
title_en = scrapy.Field()
os = scrapy.Field()
users_rating = scrapy.Field()
推荐答案
你需要编写一个自定义的 item exporter,或者单独处理Scrapy生成的文件的后处理,例如使用独立的 Python 脚本将输出格式转换为所需格式.
You need to write a custom item exporter, or handle post-processing of the file generated by Scrapy separately, e.g. with a standalone Python script that converts from the output format to the desired format.
这篇关于Scrapy 的 JSON 输出形成了一个 JSON 对象数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!