文件拆分成字段数使用较小的文件 [英] splitting file into smaller files using by number of fields
问题描述
我有一个很难打破的大(50GB)csv文件成更小的一部分。每一行都有几千领域。有些字段是在双引号的字符串,其他都是整数,小数和布尔值。
I'm having a hard time breaking a large (50GB) csv file into smaller part. Each line has a few thousand fields. Some of the fields are strings in double quotes, others are integers, decimals and boolean.
欲由每行中的字段数来解析由线和分割的文件行。这些字符串包含可能的几个逗号,(如),还有一些空白领域。
I want to parse the file line by line and split by the number of fields in each row. The strings contain possibly several commas (such as ), as well as a number of empty fields.
,, 1,30,50,由父亲,儿子和$ 4,000的女儿卖,,,,, 12 ,,, 20.9,0,
,,1,30,50,"Sold by father,son and daughter for $4,000" , ,,,, 12,,,20.9,0,
我试图用
perl -pe' s{("[^"]+")}{($x=$1)=~tr/,/|/;$x}ge ' file >> file2
修改引号中的逗号|但没有奏效。我打算使用
to change the commas inside the quotes to | but that didn't work. I plan to use
awk -F"|" conditional statement appending to new k_fld_files file2
有没有更简单的方法来做到这一点吗?我看着蟒蛇,但我可能需要一个实用工具,将流处理文件,一行行。
Is there an easier way to do this please? I'm looking at python, but I probably need a utility that will stream process the file, line by line.
推荐答案
使用Python - 如果你只是想解析CSV包括嵌入式分隔符,并用新的分隔符中流了出来,那么一些诸如:
Using Python - if you just want to parse CSV including embedded delimiters, and stream out with a new delimiter, then something such as:
import csv
import sys
with open('filename.csv') as fin:
csvout = csv.writer(sys.stdout, delimiter='|')
for row in csv.reader(fin):
csvout.writerow(row)
否则,这不是更加困难,使这个做各种东西。
Otherwise, it's not much more difficult to make this do all kinds of stuff.
输出到文件每列的示例(未经测试):
Example of outputting to files per column (untested):
cols_to_output = {}
for row in csv.reader(fin):
for colno, col in enumerate(row):
output_to = cols_to_output.setdefault(colno, open('column_output.{}'.format(colno), 'wb')
csv.writer(output_to).writerow(row)
for fileno in cols_to_output.itervalues():
fileno.close()
这篇关于文件拆分成字段数使用较小的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!