Json Lines(Jsonl)生成器转换为CSV格式 [英] Json lines (Jsonl) generator to csv format

查看:415
本文介绍了Json Lines(Jsonl)生成器转换为CSV格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的Jsonl文件(超过6GB),我需要将其转换为.csv格式.运行后:

I have a large Jsonl file (6GB+) which I need to convert to .csv format. After running:

import json
import pandas

with open(root_dir + 'filename.json') as json_file:
  for line in json_file:
    data = json.loads(line)
    print(data)

返回以下格式的许多记录:

Many records of the below format are returned:

{'url': 'https://twitter.com/CHItraders/status/945958273861275648', 'date': '2017-12-27T10:03:22+00:00', 'content': 'Why #crypto currencies like $BTC #Bitcoin are set for global domination - MUST READ! - https :// t.co/C1kEhoLaHr https :// t.co/sZT43PBDrM', 'renderedContent': 'Why #crypto currencies like $BTC #Bitcoin are set for global domination - MUST READ! - BizNews.com biznews.com/wealth-buildin…', 'id': 945958273861275648, 'username': 'CHItraders', 'user': {'username': 'CHItraders', 'displayname': 'CHItraders', 'id': 185663478, 'description': 'Options trader. Market-news. Nothing posted constitutes as advice. Do your own diligence.', 'rawDescription': 'Options trader. Market-news. Nothing posted constitutes as advice. Do your own diligence.', 'descriptionUrls': [], 'verified': False, 'created': '2010-09-01T14:52:28+00:00', 'followersCount': 1196, 'friendsCount': 490, 'statusesCount': 38888, 'favouritesCount': 10316, 'listedCount': 58, 'mediaCount': 539, 'location': 'Chicago, IL', 'protected': False, 'linkUrl': None, 'linkTcourl': None, 'profileImageUrl': 'https://pbs.twimg.com/profile_images/623935252357058560/AaeCRlHB_normal.jpg', 'profileBannerUrl': 'https://pbs.twimg.com/profile_banners/185663478/1437592670'}, 'outlinks': ['http://BizNews.com', 'https://www.biznews.com/wealth-building/2017/12/27/bitcoin-rebecca-mqamelo/#.WkNv2bQ3Awk.twitter'], 'outlinksss': 'http://BizNews.com https://www.biznews.com/wealth-building/2017/12/27/bitcoin-rebecca-mqamelo/#.WkNv2bQ3Awk.twitter', 'tcooutlinks': ['https :// t.co/C1kEhoLaHr', 'https :// t.co/sZT43PBDrM'], 'tcooutlinksss': 'https :// t.co/C1kEhoLaHr https :// t.co/sZT43PBDrM', 'replyCount': 0, 'retweetCount': 0, 'likeCount': 0, 'quoteCount': 0, 'conversationId': 945958273861275648, 'lang': 'en', 'source': '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', 'media': None, 'retweetedTweet': None, 'quotedTweet': None, 'mentionedUsers': None}

由于文件大小,我无法使用转换:

Due to the size of the file, I can't use the conversion:

with open(root_dir + 'filename.json', 'r', encoding ='utf-8-sig') as f:
    data = f.readlines()

data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
newdf = pd.read_json(StringIO(data_json_str))
newdf.to_csv(root_dir + 'output.csv')

由于MemoryError.我正在尝试使用下面的生成器,并将每一行写入csv,这应该消除MemoryError问题:

due to MemoryError. I am trying to use the below generator and write each line to the csv, which should negate the MemoryError issue:

def yield_line_delimited_json(path):
    """
    Read a line-delimited json file yielding each row as a record
    :param str path:
    :rtype: list[object]
    """
    with open(path, 'r') as json_file:
        for line in json_file:
            yield json.loads(line)

new = yield_line_delimited_json(root_dir + 'filename.json')

with open(root_dir + 'output.csv', 'w') as f:
    for x in new:
        f.write(str(x))

但是,数据未写入.csv格式.感谢您提供关于为何数据未写入csv文件的任何建议!

However, the data is not written to the .csv format. Any advice on why the data isn't writing to the csv file is greatly appreciated!

编辑:添加了示例JSONl

Edit: Added sample JSONl

推荐答案

生成器似乎完全多余.

with open(root_dir + 'filename.json') as old, open(root_dir + 'output.csv', 'w') as csvfile:
    new = csv.writer(csvfile)
    for x in old:
        row = json.loads(x)
        new.writerow(row)

如果一行JSON不能简单地产生一个字符串和数字数组,您仍然需要弄清楚如何将其从JSON内部的任何结构转换为可以序列化为JSON的一维列表的东西固定长度的字符串和数字.

If one line of JSON does not simply produce an array of strings and numbers, you still need to figure out how to convert it from whatever structure is inside the JSON to something which can usefully be serialized as a one-dimensional list of strings and numbers of a fixed length.

如果可以期望您的JSON可靠地包含一个带有一组固定的关键字-值对的字典,则可以尝试

If your JSON can be expected to reliably contain a single dictionary with a fixed set of keyword-value pairs, maybe try

from csv import DictWriter
import json

with open(jsonfile, 'r') as inp, open(csvfile, 'w') as outp:
    writer = DictWriter(outp, fieldnames=[
            'url', 'date', 'content', 'renderedContent', 'id', 'username',
            # 'user',  # see below
            'outlinks', 'outlinksss', 'tcooutlinks', 'tcooutlinksss',
            'replyCount', 'retweetCount', 'likeCount', 'quoteCount',
            'conversationId', 'lang', 'source', 'media', 'retweetedTweet',
            'quotedTweet', 'mentionedUsers'])
    for line in inp:
        row = json.loads(line)
        writer.writerow(row)

我省略了user字段,因为在您的示例中,此键包含一个嵌套结构,如果不进行进一步处理,就不能轻易将其转换为CSV.也许您只想将user.id提取到新字段user_id中;还是您想将整个user结构提升到主记录中诸如user_usernameuser_displaynameuser_id等的主列的扁平列表中?

I omitted the user field because in your example, this key contains a nested structure which cannot easily be transformed into CSV without further mangling. Perhaps you would like to extract just user.id into a new field user_id; or perhaps you would like to lift the entire user structure into a flattened list of additional columns in the main record like user_username, user_displayname, user_id, etc?

更详细地说,CSV基本上是一个二维矩阵,其中每一行是对应于数据中一条记录的一维列集合,其中每一列可以包含一个字符串或一个数字.每行都必须具有完全相同的列数,尽管您可以将其中一些保留为空.

In some more detail, CSV is basically a two-dimensional matrix where every row is a one-dimensional collection of columns corresponding to one record in the data, where each column can contain one string or one number. Every row needs to have exactly the same number of columns, though you can leave some of them empty.

可以轻松转换为CSV的JSON看起来像

JSON which can trivially be transformed into CSV would look like

["Adolf", "1945", 10000000]
["Joseph", "1956", 25000000]
["Donald", null, 1000000]

可以通过某种转换(您必须单独指定,例如使用上面指定的字典键顺序进行转换)将JSON转换为CSV的

JSON which can be transformed to CSV by some transformation (which you'd have to specify separately, like for example with the dictionary key ordering specified above) might look like

{"name": "Adolf", "dod": "1945", "death toll": 10000000}
{"dod": "1956", "name": "Joseph", "death toll": 25000000}
{"death toll": 1000000, "name": "Donald"}

(只是为了使它更有趣,缺少一个字段,并且字典的顺序从一条记录到另一条记录是不同的.这不是典型的,但绝对在有效的极端情况下,Python可能无法猜测其范围)拥有如何处理的方法.)

(Just to make it more interesting, one field is missing, and the dictionary order varies from one record to the next. This is not typical, but definitely within the realm of valid corner cases that Python could not possibly guess on its own how to handle.)

大多数现实世界中的JSON比这两个简单示例中的任何一个都要复杂得多,以至于我们可以说一般情况下无法解决该问题.

Most real-world JSON is significantly more complex than either of these simple examples, to the point where we can say that the problem is not possible to solve in the general case.

这篇关于Json Lines(Jsonl)生成器转换为CSV格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆