文件拆分成字段数使用较小的文件 [英] splitting file into smaller files using by number of fields

查看:92
本文介绍了文件拆分成字段数使用较小的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很难打破的大(50GB)csv文件成更小的一部分。每一行都有几千领域。有些字段是在双引号的字符串,其他都是整数,小数和布尔值。

I'm having a hard time breaking a large (50GB) csv file into smaller part. Each line has a few thousand fields. Some of the fields are strings in double quotes, others are integers, decimals and boolean.

欲由每行中的字段数来解析由线和分割的文件行。这些字符串包含可能的几个逗号,(如),还有一些空白领域。

I want to parse the file line by line and split by the number of fields in each row. The strings contain possibly several commas (such as ), as well as a number of empty fields.

,, 1,30,50,由父亲,儿子和$ 4,000的女儿卖,,,,, 12 ,,, 20.9,0,

,,1,30,50,"Sold by father,son and daughter for $4,000" , ,,,, 12,,,20.9,0,

我试图用

perl -pe'  s{("[^"]+")}{($x=$1)=~tr/,/|/;$x}ge  '  file >> file2

修改引号中的逗号|但没有奏效。我打算使用

to change the commas inside the quotes to | but that didn't work. I plan to use

awk -F"|" conditional statement appending to new k_fld_files file2

有没有更简单的方法来做到这一点吗?我看着蟒蛇,但我可能需要一个实用工具,将流处理文件,一行行。

Is there an easier way to do this please? I'm looking at python, but I probably need a utility that will stream process the file, line by line.

推荐答案

使用Python - 如果你只是想解析CSV包括嵌入式分隔符,并用新的分隔符中流了出来,那么一些诸如:

Using Python - if you just want to parse CSV including embedded delimiters, and stream out with a new delimiter, then something such as:

import csv
import sys
with open('filename.csv') as fin:
    csvout = csv.writer(sys.stdout, delimiter='|')
    for row in csv.reader(fin):
        csvout.writerow(row)

否则,这不是更加困难,使这个做各种东西。

Otherwise, it's not much more difficult to make this do all kinds of stuff.

输出到文件每列的示例(未经测试):

Example of outputting to files per column (untested):

cols_to_output = {}
for row in csv.reader(fin):
    for colno, col in enumerate(row):
        output_to = cols_to_output.setdefault(colno, open('column_output.{}'.format(colno), 'wb')
        csv.writer(output_to).writerow(row)

for fileno in cols_to_output.itervalues():
    fileno.close()

这篇关于文件拆分成字段数使用较小的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆