使用bash或python对巨大的JSON文件进行排序 [英] Sort huge JSON file using bash or python

查看:150
本文介绍了使用bash或python对巨大的JSON文件进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

要求:我有一个.gz格式的Json文件.因此,压缩后的大小约为500 MB.当我提取它时,json文件几乎变成了大约10 GB.提取的JSON文件逐行包含单个JSON对象.我想要的是使用任何bash脚本或python程序基于字段ps对文件进行排序.

Requirement: I have a Json file which is in .gz format. So, when it is compressed it is around ~500 MB in size. When I extract it, the json file becomes nearly around ~10 GB. The extracted JSON file contains individual JSON objects line by line.What I want is to sort the file based on a field ps using either any bash script or python programs.

由于文件太大,建议不要将其加载到内存中.因此,我使用gzcat和cat bash命令流式传输JSON数据,然后将它们通过管道传输到jq以进行排序.但是系统在此过程中没有响应,或者在output.json中得到了空文件

Because the file is too large, its not advisable to load it into memory. So, I used gzcat and cat bash command to stream the JSON data and then pipe them to jq for sorting purpose. But either the system doesn't respond during the process or I get empty file in the output.json

>cat  sth2.json | parallel --pipe --group --block 1000M --recend '\n}\n' "jq -s -c 'sort_by(.ps) | .[]'"  > "output.json"
>gzcat  sth2.json.gz | parallel --pipe --group --block 1000M --recend '\n}\n' "jq -s -c 'sort_by(.ps) | .[]'"  > "output.json"

硬件: 16GB RAM, 核心i5处理器

Hardware: 16GB RAM, core i5 processor

示例JSON数据:-

{
    "ps":"abc"
    ....
}
{   
    "ps":"def"
    ......
}
{
    "ps":"abc"
    ....
}

预期输出:

{
    "ps":"abc"
    ....
}
{   
    "ps":"abc"
    ....
}
{
    "ps":"def"
    ....
}

我不明白我在做什么错.谁能建议如何对如此巨大的JSON文件进行排序? 我关注的链接: https://github.com/joelpurra/jq-hopkok/tree/master/src/parallelism

I don't understand what I am doing wrong. Can anyone suggest how to sort such huge JSON file ? Links I followed: https://github.com/joelpurra/jq-hopkok/tree/master/src/parallelism

此外,如果没有Hadoop,我是否可以通过任何Map Reduce进行任何操作?

Also, is there any way I can do via any Map reduce without Hadoop ?

方法1:将数据流式传输到本地Sqlite数据库.

import sqlite3
import fileinput

PATH=".../sqlite-snapshot-201904101324/testDB.db"
insert_query="INSERT INTO feeds (data) VALUES (?)"

def db_connect(db_path=PATH):
    con = sqlite3.connect(db_path)
    return con

con = db_connect() # connect to the database
cur = con.cursor() # instantiate a cursor obj

record_count = 0
for line in fileinput.input():
    cur.execute(insert_query,(line,))

命令行:

>gzcat sth.json.gz | python insert.py

推荐答案

以下是其中一项基于建议的解决方案:

Here is one solution based on the suggestion in one of the comments:

如果可以的话在行的前面加上sort键,以便可以将它们作为文本而不是JSON进行排序,然后GNU sort可以轻松地对10GB以上的文件进行排序,而无需将其加载到内存中. –那个人

If you can e.g. prefix the lines with the sort key so that they can be sorted as text rather than JSON, then GNU sort can easily sort 10GB+ files without loading them into memory. – that other guy

您可以使用jq沿着以下行执行此操作:

You can use jq to do this along the following lines:

jq -cr '"\(.ps)\t\(.)"' 

这将产生带有制表符分隔值的行,如下所示:

This will produce lines with tab-separated values like so:

abc {"ps":"abc","x":0}
abc {"ps":"abc","x":1}

使用-c选项可确保将每对(即排序键和对象)写入一行.

Using the -c option ensures that each pair (i.e. the sorting key and object) is written to a single line.

现在,您可以轻松地对行进行排序,例如使用sort;然后使用cut剥离.ps字段.

Now you can easily sort the lines, e.g. using sort; and then use e.g. cut to strip the .ps field.

最后,如果您确实希望格式化输出,则可以再次使用jq(例如jq .),关键是jq默认是面向流的.

Finally, if you really want the output to be formatted, you can again use jq ( e.g. jq .), the point being that jq is by default stream-oriented.

以上假设.ps值不带制表符.如果不是这种情况,则可以使用其他字段分隔符,也可以:

The above assumes that the .ps values are tab-free. If that is not the case, then you could either use a different field-separator, or:

jq -cr '([.ps] | @tsv) + "\t" + tostring'

这篇关于使用bash或python对巨大的JSON文件进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆