提取我的.json.gz文件时,会向其中添加一些字符-该文件无法存储为json文件 [英] When extracting my .json.gz file, some characters are added to it - and the file cannot be stored as a json file

查看:288
本文介绍了提取我的.json.gz文件时,会向其中添加一些字符-该文件无法存储为json文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解压缩某些.json.gz文件,但是gzip向其中添加了一些字符,因此使其对于JSON不可读.

I am trying to unzip some .json.gz files, but gzip adds some characters to it, and hence makes it unreadable for JSON.

您认为问题出在哪里,我该如何解决?

What do you think is the problem, and how can I solve it?

如果我使用7zip之类的解压缩软件来解压缩文件,此问题将消失.

If I use unzipping software such as 7zip to unzip the file, this problem disappears.

这是我的代码:

with gzip.open('filename' , 'rb') as f:
    json_content = json.loads(f.read())

这是我得到的错误:

Exception has occurred: json.decoder.JSONDecodeError
Extra data: line 2 column 1 (char 1585)

我使用了以下代码:

with gzip.open ('filename', mode='rb') as f:
    print(f.read())

并意识到文件以b'开头(如下所示):

and realized that the file starts with b' (as shown below):

b'{"id":"tag:search.twitter.com,2005:5667817","objectType":"activity"

我认为b'是导致该文件在下一阶段不可用的原因.您是否有任何解决方案来删除b'?这个压缩文件有数百万个,我无法手动执行.

I think b' is what makes the file unworkable for the next stage. Do you have any solution to remove the b'? There are millions of this zipped file, and I cannot manually do that.

我在以下链接中上传了这些文件的示例 仅几个json.gz文件

推荐答案

问题不在于您在print(f.read())中看到的b前缀,这仅意味着数据是bytes序列(即整数ASCII值)而不是UTF-8字符序列(即常规的Python字符串)-json.loads()可以接受. JSONDecodeError是因为gzip压缩文件中的数据不是有效的 JSON格式是必需的.该格式看起来像是 JSON行一样-Python标准库json模块没有(直接)支持.

The problem isn't with that b prefix you're seeing with print(f.read()), which just means the data is a bytes sequence (i.e. integer ASCII values) not a sequence of UTF-8 characters (i.e. a regular Python string) — json.loads() will accept either. The JSONDecodeError is because the data in the gzipped file isn't in valid JSON format, which is required. The format looks like something known as JSON Lines — which the Python standard library json module doesn't (directly) support.

沙丘答案

Dunes' answer to the question @Charles Duffy marked this—at one point—as a duplicate of wouldn't have worked as presented because of this formatting issue. However from the sample file you added a link to in your question, it looks like there is a valid JSON object on each line of the file. If that's true of all of your files, then a simple workaround is to process each file line-by-line.

这是我的意思:

import json
import gzip


filename = '00_activities.json.gz'  # Sample file.

json_content = []
with gzip.open(filename , 'rb') as gzip_file:
    for line in gzip_file:  # Read one line.
        line = line.rstrip()
        if line:  # Any JSON data on it?
            obj = json.loads(line)
            json_content.append(obj)

print(json.dumps(json_content, indent=4))  # Pretty-print data parsed.    

请注意,它输出的输出显示了有效的JSON可能看起来是什么样的.

Note that the output it prints shows what valid JSON might have looked like.

这篇关于提取我的.json.gz文件时,会向其中添加一些字符-该文件无法存储为json文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆