如何使用GZIP在python程序中压缩JSON数据? [英] how to use GZIP to compress JSON data in python program?

查看:882
本文介绍了如何使用GZIP在python程序中压缩JSON数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个AWS Kinesis python程序-生产者将数据发送到我的流.但是我的JSON文件是5MB.我想使用GZIP或任何其他最佳方法压缩数据.我的生产者代码是这样的:

I have an AWS Kinesis python program - Producer to send data to my stream. But my JSON file is 5MB. I would like to compress the data using GZIP or any other best methods. My producer code is like this :

import boto3
import json
import csv
from datetime import datetime
import calendar
import time
import random



# putting data to Kinesis

my_stream_name='ApacItTeamTstOrderStream'

kinesis_client=boto3.client('kinesis',region_name='us-east-1')


with open('output.json', 'r') as file:
    for line in file:
        put_response=kinesis_client.put_record(
            StreamName=my_stream_name,
            Data=line,
            PartitionKey=str(random.randrange(3000)))
    
        print(put_response)

我的要求是:

我需要压缩此数据,然后将压缩后的数据推送到 在推送数据后,Kinesis消耗了数据后,我们需要 解压缩...

I need to compress this data and then pushed the compressed data to Kinesis after pushing this data, when we consume this, we need to decompress it...

由于我对此很陌生,有人可以指导我或建议我在现有代码中添加哪种程序?

Since I am very new to this, can someone guide me or suggest to me what kind of programs I should add in the existing code?

推荐答案

有两种压缩数据的方法:

There are 2 ways in which you can compress the data :

1..在Firehose流上启用GZIP/Snappy压缩-这可以通过控制台本身完成

1. Enable GZIP/Snappy compression on Firehose Stream - This can be done via Console itself

Firehose缓冲数据,在达到阈值后,它将获取所有数据并将其压缩在一起以创建gz对象.

Firehose buffers the data and after the treshold is reached, it takes all the data and compresses it together to create the gz object.

专业人士:

  • 生产者方面所需的最小工作量-只需在控制台中更改设置即可.
  • 用户方所需的最小工作量-Firehose在S3中创建.gz对象,并在对象上设置元数据以反映压缩类型.因此,如果您通过AWS SDK本身读取数据,则SDK会为您进行解压缩.

缺点:

  • 由于firehose会根据所提取数据的大小收费,因此您将无法节省Firehose的成本.您将节省S3的成本(由于对象较小).

2.按生产者代码压缩-需要编写代码

2. Compression by Producer code - Need to write the code

几天前,我用Java实现了这一点. 我们正在将100 PB的数据摄取到Firehose中(从那里将数据写入S3).这对我们来说是一笔巨大的代价.

I implemented this in Java a few days back. We were ingesting over 100 Petabytes of data into Firehose (from where it gets written to S3). This was a massive cost for us.

因此,我们决定在生产者端进行压缩.这导致压缩数据流向KF,与写入S3的情况相同.请注意,由于KF没有压缩它,因此不知道它是什么数据.结果,在s3中创建的对象不具有".gz".压缩.因此,对于对象中的数据是什么,消费者都不是明智的.然后,我们在适用于S3的AWS Java SDK之上编写了一个包装程序,该包装程序读取对象并对其进行解压缩.

So, we decided to do the compression on Producer side. This results in compressed data flowing to KF which is as is written to S3. Please note that since KF is not compressing it, it has no idea what data it is. As a result, the objects created in s3 don't have ".gz" compression. Hence, the consumers are none the wiser as to what data is in the objects. We then wrote a wrapper on top of AWS Java SDK for S3 which reads the object and decompresses it.

优点:

  • 我们的压缩系数接近90%.这直接导致Firehose成本节省了90%.加上方法1中S3的额外节省.

缺点:

  • 这不完全是一个缺点,但是需要更多的开发工作.要在AWS开发工具包的顶部创建包装器,进行测试工作等.
  • 压缩与解压缩会占用大量CPU资源.平均而言,这2个处理器将我们的CPU使用率平均提高了22%.

这篇关于如何使用GZIP在python程序中压缩JSON数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆