scrapy json 在一行输出所有项目 [英] scrapy json output all items on one line

查看:37
本文介绍了scrapy json 在一行输出所有项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图让我的输出看起来像下面的 json 格式.

I'm trying to get my output to look like the following in json format.

{"loser": "De Schepper K." ,"winner": "Herbert P.", "url":
"https://www.sofascore.com/tennis/2018-02-07"}

但我目前正在为每个失败者项目和获胜者项目获取单独的行.我希望赢家和输家都与网址在同一行.

But I'm currently getting individual lines for each loser item and winner item. I would like both winner and loser to be on the same line with the url.

{"loser": "De Schepper K.", "url": 
"https://www.sofascore.com/tennis/2018-02-07"}
{"winner": "Herbert P.", "url": 
"https://www.sofascore.com/tennis/2018-02-07"}
{"loser": "Sugita Y.", "url": 
 "https://www.sofascore.com/tennis/2018-02-07"}

我不确定是否是我的选择器导致了这种行为,但我想知道如何自定义管道,以便失败者、获胜者和日期都在同一 json 行上

I'm not sure if it's my selectors that's causing this behaviour but I'd like to know how I can customise the pipelines so the loser, winner and date are all on the same json line

我以前从未提取过 json 格式,所以对我来说是新的.您如何使用自定义管道指定每行上的 json 键和值?

I've never extracted json format before so it's new to me. How do you specify what json keys and values will be on each line using custom pipeline?

我也尝试使用 csv 项目导出器来执行此操作,但也出现了奇怪的行为.参考Scrapy 输出每列显示空行

I also tried to use csv item exporter to do this and got strange behaviour too. ref Scrapy output is showing empty rows per column

这是我的spider.py

Here's my spider.py

import scrapy
from scrapy_splash import SplashRequest
from scrapejs.items import SofascoreItemLoader
from scrapy import Spider

import json
from scrapy.http import Request, FormRequest

    class MySpider(scrapy.Spider):
    name = "jsscraper"

    start_urls = ["https://www.sofascore.com/tennis/2018-02-07"]


    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url,
                            callback=self.parse,
                            endpoint='render.html',
                            args={'wait': 1.5})



    def parse(self, response):
            for row in response.css('.event-team'):
                    il = SofascoreItemLoader(selector=row)
                    il.add_css('winner' , '.event-team:nth-
                      child(2)::text')
                    il.add_css('loser' , '.event-team:nth-
                    child(1)::text')
                    il.add_value('url', response.url)

                    yield il.load_item()

items.py

import scrapy

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose
from operator import methodcaller
from scrapy import Spider, Request, Selector

class SofascoreItem(scrapy.Item):
    loser = scrapy.Field()
    winner = scrapy.Field()
    url = scrapy.Field()



class SofascoreItemLoader(ItemLoader):
    default_item_class = SofascoreItem
    default_input_processor = MapCompose(methodcaller('strip'))
    default_output_processor = TakeFirst()

管道.py

import json
import codecs
from collections import OrderedDict

class JsonPipeline(object):

    def __init__(self):
        self.file = codecs.open('data_utf8.json' , 'w' , 
        encoding='utf-8')

    def process_item(self , item , spider):
        line = json.dumps(OrderedDict(item) , ensure_ascii=False , 
        sort_keys=False) + "\n"
        self.file.write(line)
        return item

    def close_spider(self , spider):
        self.file.close()

推荐答案

这里的问题是你在循环 .event-team 元素.
这些元素之一只能是赢家或输家,因此您会为每个元素获得一个项目.

The problem here is that you're looping over .event-team elements.
One of these elements can only be the winner or the loser, so you get an item for each.

你应该做的是循环包含两者的元素(.list-event 似乎是一个很好的候选者),并从中提取赢家和输家.

What you should be doing instead is loop over elements containing both (.list-event seems like a good candidate), and extract both the winner and loser from those.

这样,每个事件就有一个循环,因此每个事件有一个项目.

This way, you'd have one loop per event, and as a result, one item per event.

这篇关于scrapy json 在一行输出所有项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆