将CSV文件拆分成相等的部分? [英] Splitting a CSV file into equal parts?

查看:107
本文介绍了将CSV文件拆分成相等的部分?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的CSV文件,我想将其拆分成一个等于系统中CPU核心数量的数字.然后,我想使用多进程使所有内核一起在文件上工作.但是,我什至无法将文件分割成多个部分.我在Google各处浏览,发现一些示例代码似乎可以满足我的要求.这是我到目前为止的内容:

I have a large CSV file that I would like to split into a number that is equal to the number of CPU cores in the system. I want to then use multiprocess to have all the cores work on the file together. However, I am having trouble even splitting the file into parts. I've looked all over google and I found some sample code that appears to do what I want. Here is what I have so far:

def split(infilename, num_cpus=multiprocessing.cpu_count()):
    READ_BUFFER = 2**13
    total_file_size = os.path.getsize(infilename)
    print total_file_size
    files = list()
    with open(infilename, 'rb') as infile:
        for i in xrange(num_cpus):
            files.append(tempfile.TemporaryFile())
            this_file_size = 0
            while this_file_size < 1.0 * total_file_size / num_cpus:
                files[-1].write(infile.read(READ_BUFFER))
                this_file_size += READ_BUFFER
        files[-1].write(infile.readline()) # get the possible remainder
        files[-1].seek(0, 0)
    return files

files = split("sample_simple.csv")
print len(files)

for ifile in files:
    reader = csv.reader(ifile)
    for row in reader:
        print row

这两张照片显示了正确的文件大小,并且已分成4件(我的系统具有4个CPU内核).

The two prints show the correct file size and that it was split into 4 pieces (my system has 4 CPU cores).

但是,在代码的最后部分打印出每一行中的所有行时,都会出现错误:

However, the last section of the code that prints all the rows in each of the pieces gives the error:

for row in reader:
_csv.Error: line contains NULL byte

我尝试在不运行split功能的情况下打印行,并正确打印了所有值.我怀疑split函数已在生成的4个文件中添加了一些NULL字节,但我不确定为什么.

I tried printing the rows without running the split function and it prints all the values correctly. I suspect the split function has added some NULL bytes to the resulting 4 file pieces but I'm not sure why.

有人知道这是否是分割文件的正确且快速的方法吗?我只想要可以通过csv.reader成功读取的结果片段.

Does anyone know if this a correct and fast method to split the file? I just want resulting pieces that can be read successfully by csv.reader.

推荐答案

正如我在评论中所说,csv文件需要在行(或行)边界上拆分.您的代码不会执行此操作,并且可能会将它们分解成一个—的中间位置.我怀疑是您_csv.Error的原因.

As I said in a comment, csv files would need to be split on row (or line) boundaries. Your code doesn't do this and potentially breaks them up somewhere in the middle of one — which I suspect is the cause of your _csv.Error.

以下通过将输入文件处理为一系列行来避免这样做.我已经对其进行了测试,在将示例文件分成大小相等的大约块的意义上,它似乎是独立工作的,因为不太可能将全部行都完全适合一个块.

The following avoids doing that by processing the input file as a series of lines. I've tested it and it seems to work standalone in the sense that it divided the sample file up into approximately equally size chunks because it's unlikely that an whole number of rows will fit exactly into a chunk.

更新

这是比我最初发布的代码快得多的版本.改进之处在于,它现在使用临时文件自己的tell()方法来确定文件正在写入时不断变化的长度,而不是调用os.path.getsize(),从而消除了flush()文件并调用os.fsync()的需要每行写完后就放在上面.

This it is a substantially faster version of the code than I originally posted. The improvement is because it now uses the temp file's own tell() method to determine the constantly changing length of the file as it's being written instead of calling os.path.getsize(), which eliminated the need to flush() the file and call os.fsync() on it after each row is written.

import csv
import multiprocessing
import os
import tempfile

def split(infilename, num_chunks=multiprocessing.cpu_count()):
    READ_BUFFER = 2**13
    in_file_size = os.path.getsize(infilename)
    print 'in_file_size:', in_file_size
    chunk_size = in_file_size // num_chunks
    print 'target chunk_size:', chunk_size
    files = []
    with open(infilename, 'rb', READ_BUFFER) as infile:
        for _ in xrange(num_chunks):
            temp_file = tempfile.TemporaryFile()
            while temp_file.tell() < chunk_size:
                try:
                    temp_file.write(infile.next())
                except StopIteration:  # end of infile
                    break
            temp_file.seek(0)  # rewind
            files.append(temp_file)
    return files

files = split("sample_simple.csv", num_chunks=4)
print 'number of files created: {}'.format(len(files))

for i, ifile in enumerate(files, start=1):
    print 'size of temp file {}: {}'.format(i, os.path.getsize(ifile.name))
    print 'contents of file {}:'.format(i)
    reader = csv.reader(ifile)
    for row in reader:
        print row
    print ''

这篇关于将CSV文件拆分成相等的部分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆