提取我的.json.gz文件时,会向其中添加一些字符-该文件无法存储为json文件 [英] When extracting my .json.gz file, some characters are added to it - and the file cannot be stored as a json file
问题描述
我正在尝试解压缩某些.json.gz
文件,但是gzip
向其中添加了一些字符,因此使其对于JSON不可读.
I am trying to unzip some .json.gz
files, but gzip
adds some characters to it, and hence makes it unreadable for JSON.
您认为问题出在哪里,我该如何解决?
What do you think is the problem, and how can I solve it?
如果我使用7zip之类的解压缩软件来解压缩文件,此问题将消失.
If I use unzipping software such as 7zip to unzip the file, this problem disappears.
这是我的代码:
with gzip.open('filename' , 'rb') as f:
json_content = json.loads(f.read())
这是我得到的错误:
Exception has occurred: json.decoder.JSONDecodeError
Extra data: line 2 column 1 (char 1585)
我使用了以下代码:
with gzip.open ('filename', mode='rb') as f:
print(f.read())
并意识到文件以b'
开头(如下所示):
and realized that the file starts with b'
(as shown below):
b'{"id":"tag:search.twitter.com,2005:5667817","objectType":"activity"
我认为b'
是导致该文件在下一阶段不可用的原因.您是否有任何解决方案来删除b'
?这个压缩文件有数百万个,我无法手动执行.
I think b'
is what makes the file unworkable for the next stage. Do you have any solution to remove the b'
? There are millions of this zipped file, and I cannot manually do that.
我在以下链接中上传了这些文件的示例 仅几个json.gz文件
推荐答案
问题不在于您在print(f.read())
中看到的b
前缀,这仅意味着数据是bytes
序列(即整数ASCII值)而不是UTF-8字符序列(即常规的Python字符串)-json.loads()
可以接受. JSONDecodeError
是因为gzip压缩文件中的数据不是有效的 JSON格式,是是必需的.该格式看起来像是 JSON行一样-Python标准库json
模块没有(直接)支持.
The problem isn't with that b
prefix you're seeing with print(f.read())
, which just means the data is a bytes
sequence (i.e. integer ASCII values) not a sequence of UTF-8 characters (i.e. a regular Python string) — json.loads()
will accept either. The JSONDecodeError
is because the data in the gzipped file isn't in valid JSON format, which is required. The format looks like something known as JSON Lines — which the Python standard library json
module doesn't (directly) support.
沙丘答案到
Dunes' answer to the question @Charles Duffy marked this—at one point—as a duplicate of wouldn't have worked as presented because of this formatting issue. However from the sample file you added a link to in your question, it looks like there is a valid JSON object on each line of the file. If that's true of all of your files, then a simple workaround is to process each file line-by-line.
这是我的意思:
import json
import gzip
filename = '00_activities.json.gz' # Sample file.
json_content = []
with gzip.open(filename , 'rb') as gzip_file:
for line in gzip_file: # Read one line.
line = line.rstrip()
if line: # Any JSON data on it?
obj = json.loads(line)
json_content.append(obj)
print(json.dumps(json_content, indent=4)) # Pretty-print data parsed.
请注意,它输出的输出显示了有效的JSON可能看起来是什么样的.
Note that the output it prints shows what valid JSON might have looked like.
这篇关于提取我的.json.gz文件时,会向其中添加一些字符-该文件无法存储为json文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!