scrapy json 在一行输出所有项目 [英] scrapy json output all items on one line

查看：37 发布时间：2021/7/16 21:56:42 python json web-scraping scrapy

本文介绍了scrapy json 在一行输出所有项目的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图让我的输出看起来像下面的 json 格式.

I'm trying to get my output to look like the following in json format.

{"loser": "De Schepper K." ,"winner": "Herbert P.", "url":
"https://www.sofascore.com/tennis/2018-02-07"}

但我目前正在为每个失败者项目和获胜者项目获取单独的行.我希望赢家和输家都与网址在同一行.

But I'm currently getting individual lines for each loser item and winner item. I would like both winner and loser to be on the same line with the url.

{"loser": "De Schepper K.", "url": 
"https://www.sofascore.com/tennis/2018-02-07"}
{"winner": "Herbert P.", "url": 
"https://www.sofascore.com/tennis/2018-02-07"}
{"loser": "Sugita Y.", "url": 
 "https://www.sofascore.com/tennis/2018-02-07"}

我不确定是否是我的选择器导致了这种行为，但我想知道如何自定义管道，以便失败者、获胜者和日期都在同一 json 行上

I'm not sure if it's my selectors that's causing this behaviour but I'd like to know how I can customise the pipelines so the loser, winner and date are all on the same json line

我以前从未提取过 json 格式，所以对我来说是新的.您如何使用自定义管道指定每行上的 json 键和值?

I've never extracted json format before so it's new to me. How do you specify what json keys and values will be on each line using custom pipeline?

我也尝试使用 csv 项目导出器来执行此操作，但也出现了奇怪的行为.参考Scrapy 输出每列显示空行

I also tried to use csv item exporter to do this and got strange behaviour too. ref Scrapy output is showing empty rows per column

这是我的spider.py

Here's my spider.py

import scrapy
from scrapy_splash import SplashRequest
from scrapejs.items import SofascoreItemLoader
from scrapy import Spider

import json
from scrapy.http import Request, FormRequest

    class MySpider(scrapy.Spider):
    name = "jsscraper"

    start_urls = ["https://www.sofascore.com/tennis/2018-02-07"]


    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url,
                            callback=self.parse,
                            endpoint='render.html',
                            args={'wait': 1.5})



    def parse(self, response):
            for row in response.css('.event-team'):
                    il = SofascoreItemLoader(selector=row)
                    il.add_css('winner' , '.event-team:nth-
                      child(2)::text')
                    il.add_css('loser' , '.event-team:nth-
                    child(1)::text')
                    il.add_value('url', response.url)

                    yield il.load_item()

items.py

import scrapy

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose
from operator import methodcaller
from scrapy import Spider, Request, Selector

class SofascoreItem(scrapy.Item):
    loser = scrapy.Field()
    winner = scrapy.Field()
    url = scrapy.Field()



class SofascoreItemLoader(ItemLoader):
    default_item_class = SofascoreItem
    default_input_processor = MapCompose(methodcaller('strip'))
    default_output_processor = TakeFirst()

管道.py

import json
import codecs
from collections import OrderedDict

class JsonPipeline(object):

    def __init__(self):
        self.file = codecs.open('data_utf8.json' , 'w' , 
        encoding='utf-8')

    def process_item(self , item , spider):
        line = json.dumps(OrderedDict(item) , ensure_ascii=False , 
        sort_keys=False) + "\n"
        self.file.write(line)
        return item

    def close_spider(self , spider):
        self.file.close()

scrapy json 在一行输出所有项目 [英] scrapy json output all items on one line

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

scrapy json 在一行输出所有项目 [英] scrapy json output all items on one line

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭