使用 Python 将 JSON 文件分成相等/更小的部分 [英] Split JSON file in equal/smaller parts with Python

查看:30
本文介绍了使用 Python 将 JSON 文件分成相等/更小的部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在开展一个项目,在该项目中我将情绪分析用于 Twitter 帖子.我正在使用 Sentiment140 对推文进行分类.使用该工具,我每天最多可以对 1,000,000 条推文进行分类,并且我收集了大约 750,000 条推文.所以应该没问题.唯一的问题是我一次最多可以向 JSON 批量分类发送 15,000 条推文.

I am currently working on a project where I use Sentiment Analysis for Twitter Posts. I am classifying the Tweets with Sentiment140. With the tool I can classify up to 1,000,000 Tweets per day and I have collected around 750,000 Tweets. So that should be fine. The only problem is that I can send a max of 15,000 Tweets to the JSON Bulk Classification at once.

我的整个代码已设置并运行.唯一的问题是我的 JSON 文件现在包含所有 750,000 条推文.

My whole code is set up and running. The only problem is that my JSON file now contains all 750,000 Tweets.

因此我的问题是:将 JSON 拆分为具有相同结构的较小文件的最佳方法是什么?我更愿意在 Python 中执行此操作.

Therefore my question: What is the best way to split the JSON into smaller files with the same structure? I would prefer to do this in Python.

我考虑过遍历文件.但是如何在代码中指定它应该在例如 5,000 个元素之后创建一个新文件?

I have thought about iterating through the file. But how do I specify in the code that it should create a new file after for example 5,000 elements?

我很想知道最合理的方法是什么.谢谢!

I would love to get some hints what the most reasonable approach is. Thank you!

这是我目前拥有的代码.

This is the code that I have at the moment.

import itertools
import json
from itertools import izip_longest

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

# Open JSON file
values = open('Tweets.json').read()
#print values

# Adjust formatting of JSON file
values = values.replace('\n', '')    # do your cleanup here
#print values

v = values.encode('utf-8')
#print v

# Load JSON file
v = json.loads(v)
print type(v)

for i, group in enumerate(grouper(v, 5000)):
    with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
        json.dump(list(group), outputfile)

输出给出:

["data", null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, ...]

在名为outputbatch_0.json"的文件中

in a file called: "outputbatch_0.json"

编辑 2:这是 JSON 的结构.

EDIT 2: This is the structure of the JSON.

{
"data": [
{
"text": "So has @MissJia already discussed this Kelly Rowland Dirty Laundry song? I ain't trying to go all through her timelime...",
"id": "1"
},
{
"text": "RT @UrbanBelleMag: While everyone waits for Kelly Rowland to name her abusive ex, don't hold your breath. But she does say he's changed: ht\u00e2\u20ac\u00a6",
"id": "2"
},
{
"text": "@Iknowimbetter naw if its weak which I dont think it will be im not gonna want to buy and up buying Kanye or even Kelly Rowland album lol",
"id": "3"}
]
}

推荐答案

使用迭代分组器;itertools 模块配方列表 包括以下内容:

Use an iteration grouper; the itertools module recipes list includes the following:

from itertools import izip_longest

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

这使您可以以 5000 为一组迭代推文:

This lets you iterate over your tweets in groups of 5000:

for i, group in enumerate(grouper(input_tweets, 5000)):
    with open('outputbatch_{}.json'.format(i), 'w') as outputfile:
        json.dump(list(group), outputfile)

这篇关于使用 Python 将 JSON 文件分成相等/更小的部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆