如何根据python的常见记录将大型csv文件分解成小文件 [英] How can I break down a large csv file into small files based on common records by python

查看:140
本文介绍了如何根据python的常见记录将大型csv文件分解成小文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想做什么:

我想要做的是,我有一个大的.csv文件。我想将这个大csv文件分解成许多小文件,基于BB列中的公共记录,alos在HH列中包含1,以及所有在HH列中包含0的不常见记录。

What I want to do is that I have a big .csv file. I want to break down this big csv file into many small files based on the common records in BB column that alos contain 1 in the HH column, and all uncommon records that contain 0 in HH column.

因此,所有文件将包含BB列中的常用记录,其中HH列中包含1 ,以及在BB列中没有记录且在HH列中包含0的所有非常见记录。文件名应基于第2列(BB)的公共记录。请看下面的场景。
任何建议想法都非常受欢迎。

As a result, all files will contain common records in BB column that contain 1 in the HH column, and all uncommon records that has no records in BB column and contain 0 in the HH column. The file name should be based on the common record of column 2 (BB). Please take a look below for the scenarion. Any suggestion idea is appreciated highly.

bigFile.csv:

bigFile.csv :

   AA      BB      CC       DD     EE      FF    GG      HH
   12      53     115       10     3       3     186     1
   12      53     01e       23     3       2             1
   12      53     0ce       65     1       3             1
   12      53     173       73     4       2             1
   12      59     115       0      3       3     186     1
   12      59     125       0      3       3     186     1
   12      61     01e       23     3       2             1
   12      61     b6f       0      1       1             1
   12      61     b1b       0      6       5     960     1
   12             68b       95     3       5     334     0
   12             31a       31     2       2             0
   12             221       0      4       5             0
   12             12b       25     5       4     215     0
   12             a10       36     5       1             0      

我的预期结果文件如下:

My expected results files woud be as follows:

53.csv:

   AA      BB      CC       DD     EE      FF    GG      HH
   12      53     115       10     3       3     186     1
   12      53     01e       23     3       2             1
   12      53     0ce       65     1       3             1
   12      53     173       73     4       2             1
   12             68b       95     3       5     334     0
   12             31a       31     2       2             0
   12             221       0      4       5             0
   12             12b       25     5       4     215     0
   12             a10       36     5       1             0      

59.csv:

   AA      BB      CC       DD     EE      FF    GG      HH
   12      59     115       0      3       3     186     1
   12      59     125       0      3       3     186     1
   12             68b       95     3       5     334     0
   12             31a       31     2       2             0
   12             221       0      4       5             0
   12             12b       25     5       4     215     0
   12             a10       36     5       1             0      

61.csv:

   AA      BB      CC       DD     EE      FF    GG      HH
   12      61     01e       23     3       2             1
   12      61     b6f       0      1       1             1
   12      61     b1b       0      6       5    960      1
   12             68b       95     3       5    334      0
   12             31a       31     2       2             0
   12             221       0      4       5             0
   12             12b       25     5       4    215      0
   12             a10       36     5       1             0      


推荐答案

对于您提供的数据,以下脚本将生成您请求的输出文件。它将对文件夹中的所有CSV文件执行此操作:

For the data you have provided, the following script will produce your requested output files. It will perform this operation on ALL CSV files found in the folder:

from itertools import groupby
import glob
import csv
import os

def remove_unwanted(rows):
    return [['' if col == 'NULL' else col for col in row[2:]] for row in rows]

output_folder = 'temp'  # make sure this folder exists

# Search for ALL CSV files in the current folder
for csv_filename in glob.glob('*.csv'):
    with open(csv_filename) as f_input:
        basename = os.path.splitext(os.path.basename(csv_filename))[0]      # e.g. bigfile

        csv_input = csv.reader(f_input)
        header = next(csv_input)
        # Create a list of entries with '0' in last column
        id_list = remove_unwanted(row for row in csv_input if row[7] == '0')
        f_input.seek(0)     # Go back to the start
        header = remove_unwanted([next(csv_input)])

        for k, g in groupby(csv_input, key=lambda x: x[1]):
            if k == '':
                break

            # Format an output file name in the form 'bigfile_53.csv'
            file_name = os.path.join(output_folder, '{}_{}.csv'.format(basename, k))

            with open(file_name, 'wb') as f_output:
                csv_output = csv.writer(f_output)
                csv_output.writerows(header)
                csv_output.writerows(remove_unwanted(g))
                csv_output.writerows(id_list)

这将导致文件 bigfile_53.csv bigfile_59.csv bigfile_61.csv 输出文件夹 temp 。例如 bigfile_53.csv 将显示如下:

This will result in the files bigfile_53.csv, bigfile_59.csv and bigfile_61.csv being created in an output folder called temp. For example bigfile_53.csv will appear as follows:

包含字符串NULL的条目将转换为空字符串,前两列将被删除(根据OP的注释)。

Entries containing the string 'NULL' will be converted to an empty string, and the first two columns will be removed (as per OP's comment).

在Python 2.7.9中测试

Tested in Python 2.7.9

这篇关于如何根据python的常见记录将大型csv文件分解成小文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆