在Python中打开一个大的JSON文件,没有换行符csv转换Python 2.6.6 [英] Opening A large JSON file in Python with no newlines for csv conversion Python 2.6.6

查看:226
本文介绍了在Python中打开一个大的JSON文件,没有换行符csv转换Python 2.6.6的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图将一个非常大的json文件转换为csv。我已经能够将这种类型的小文件转换为10记录(例如)csv文件。但是,当试图转换一个大文件(在csv文件中的50000行的顺序)它不工作。数据由curl命令创建,其中-o指向要创建的json文件。输出的文件中没有换行符。 csv文件将使用csv.DictWriter()和(其中data是json文件输入)以

形式写入

  rowcount = len(data ['MainKey'])
colcount = len(data ['MainKey'] [0] ['Fields'])

然后我遍历行和列的范围以获取csv字典条目

  csvkey = data ['MainKey'] [recno] ['Fields'] [colno] ['name'] 
cvsval = data ['MainKey'] [recno] ] ['values'] ['value']



我尝试使用其他问题的答案,但是他们没有使用大文件( du -m bigfile.json = 157 ),我想处理的文件更大。



尝试获取每行的大小显示

  myfile = open json','r')。 
line = readline():
print len(line)

这将以完整字符串的形式读取整个文件。因此,一个小文件将显示长度为67744,而较大的文件将显示为163815116。



尝试直接从

读取数据

  data = json.load(infile)


$ b



尝试使用



<$> p $ p> def json_parse(self,fileobj,decoder = JSONDecoder(),buffersize = 2048):


yield results


,如另一个答案,使用一个72 kb文件(10行,22列),但似乎对于157 mb的中间大小的文件(来自du -m bigfile.json)



请注意,调试打印显示每个块的大小是由默认输入参数指定的2048。看起来,它试图通过整个163815116(从上面的len显示)在2048块。如果我将块大小更改为32768,简单的数学表明它将花费5000个周期通过循环处理文件。



更改为块大小524288出口循环大约每11块,但仍然需要大约312块来处理整个文件



如果我可以让它停止在每个行项目的结尾,我会能够处理该行并将其发送到csv文件基于如下所示的形式。



vi小文件显示它是

  {MainKey:[{Fields:[{Value:{'value':val},'name': 'valname'},{'Value':{'value':val},'name':'valname'}},(其他键)},{'Fields'...} )} 

我不能使用ijson,因为我必须为系统设置这个,我无法导入其他软件。

解决方案

我使用大小为8388608(0x800000十六进制)的块大小来处理文件。然后我处理了作为循环的一部分读取的行,保持已处理的行数和已舍弃的行。在每个处理函数中,我将数字添加到总数,以便我可以跟踪处理的总记录。



这似乎是需要去的方式。



下次出现这样的问题时,请强调必须指定大块大小,而不是原始答案中所示的2048。



循环

  first = True 
for self.json_parse (inf):
records = len(data ['MainKey'])
columns = len(data ['MainKey'] [0] ['Fields'])
如果第一个:
#初始化输出为DictWriter
ofile,outf,fields = self.init_csv(csvname,data,records,columns)
first = False
reccount,errcount = self.parse_records ,数据,字段,记录)

在解析例程中

 用于范围(记录)中的rec:
currec = data ['MainKey'] [rec]
#如果每个列计数可以不同
columns = len(currec ['Fields'])
retval,valrec = self.build_csv_row(currec,columns,fields)

要解析列,请使用

 为列:
dataname = currec ['Fields'] [col] ['name']
dataval = currec ['Fields'] [col] ['values']

因此,引用现在可以正常处理。大块显然允许处理足够快以处理数据,同时足够小以不使系统过载。


I am attempting to convert a very large json file to csv. I have been able to convert a small file of this type to a 10 record (for example) csv file. However, when trying to convert a large file (on the order of 50000 rows in the csv file) it does not work. The data was created by a curl command with the -o pointing to the json file to be created. The file that is output does not have newline characters in it. The csv file will be written with csv.DictWriter() and (where data is the json file input) has the form

rowcount = len(data['MainKey'])
colcount = len(data['MainKey'][0]['Fields'])

I then loop through the range of the rows and columns to get the csv dictionary entries

csvkey = data['MainKey'][recno]['Fields'][colno]['name']
cvsval = data['MainKey'][recno][['Fields'][colno]['Values']['value']

I attempted to use the answers from other questions, but they did not work with a big file (du -m bigfile.json = 157) and the files that I want to handle are even larger.

An attempt to get the size of each line shows

myfile = open('file.json','r').
line = readline():
print len(line)

shows that this reads the entire file as a full string. Thus, one small file will show a length of 67744, while a larger file will show 163815116.

An attempt to read the data directly from

data=json.load(infile)

gives the error that other questions have discussed for the large files

An attempt to use the

def json_parse(self, fileobj, decoder=JSONDecoder(), buffersize=2048):


  yield results

as shown in another answer, works with a 72 kb file (10 rows, 22 columns) but seems to either lock up or take an interminable amount of time for an intermediate sized file of 157 mb (from du -m bigfile.json)

Note that a debug print shows that each chunk is 2048 in size as specified by the default input argument. It appears that it is trying to go through the entire 163815116 (shown from the len above) in 2048 chunks. If I change the chunk size to 32768, simple math shows that it would take 5,000 cycles through the loop to process the file.

A change to a chunk size of 524288 exits the loop approximately every 11 chunks but should still take approximately 312 chunks to process the entire file

If I can get it to stop at the end of each row item, I would be able to process that row and send it to the csv file based on the form shown below.

vi on the small file shows that it is of the form

{"MainKey":[{"Fields":[{"Value": {'value':val}, 'name':'valname'}, {'Value': {'value':val}, 'name':'valname'}}], (other keys)},{'Fields' ... }] (other keys on MainKey level) }

I cannot use ijson as I must set this up for systems that I cannot import additional software for.

解决方案

I wound up using a chunk size of 8388608 (0x800000 hex) in order to process the files. I then processed the lines that had been read in as part of the loop, keeping count of rows processed and rows discarded. At each process function, I added the number to the totals so that I could keep track of total records processed.

This appears to be the way that it needs to go.

Next time a question like this is asked, please emphasize that a large chunk size must be specified and not the 2048 as shown in the original answer.

The loop goes

first = True
for data in self.json_parse(inf):
  records = len(data['MainKey'])
  columns = len(data['MainKey'][0]['Fields'])
  if first:
    # Initialize output as DictWriter
    ofile, outf, fields = self.init_csv(csvname, data, records, columns)
    first = False
  reccount, errcount = self.parse_records(outf, data, fields, records)

Within the parsing routine

for rec in range(records):
  currec = data['MainKey'][rec]
  # If each column count can be different
  columns = len(currec['Fields'])
  retval, valrec = self.build_csv_row(currec, columns, fields)

To parse the columns use

for col in columns:
  dataname = currec['Fields'][col]['name']
  dataval = currec['Fields'][col]['Values']['value']

Thus the references now work and the processing is handled correctly. The large chunk apparently allows the processing to be fast enough to handle the data while being small enough not to overload the system.

这篇关于在Python中打开一个大的JSON文件,没有换行符csv转换Python 2.6.6的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆