Python:将CSV文件拆分为相等的部分 [英] Python: splitting a CSV file into equal parts

查看:242
本文介绍了Python:将CSV文件拆分为相等的部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型CSV文件,我想分成一个数字,等于系统中的CPU核心数。我想要使​​用多进程使所有的核心在一起工作的文件。但是,我有麻烦甚至将文件分割成部分。我看过谷歌,我发现一些示例代码,似乎做我想要的。这是我到目前为止:

  def split(infilename,num_cpus = multiprocessing.cpu_count()):
READ_BUFFER = 2 ** 13
total_file_size = os.path.getsize(infilename)
print total_file_size
files = list()
open(infilename,'rb')as infile :
for i in xrange(num_cpus):
files.append(tempfile.TemporaryFile())
this_file_size = 0
while this_file_size& 1.0 * total_file_size / num_cpus:
files [-1] .write(infile.read(READ_BUFFER))
this_file_size + = READ_BUFFER
files [-1] .write(infile.readline )#获取剩余的
文件[-1] .seek(0,0)
返回文件

files = split(sample_simple.csv)
打印len(文件)

ifile在文件中:
reader = csv.reader(ifile)
读取器中的行:
打印行

这两个打印显示正确的文件大小,并且它分为4个部分(我的系统有4个CPU内核) / p>

但是,打印每个片段中所有行的代码的最后一部分会出现错误:

  for reader in reader:
_csv.Error:line contains NULL byte

我尝试打印行而不运行拆分功能,它打印所有的值正确。我怀疑split函数已经添加了一些NULL字节到生成的4个文件片断,但我不知道为什么。



有谁知道这是一个正确和快速的方法拆分文件?

解决方案

正如我在评论中说的,csv文件会需要在行(或线)边界上拆分。你的代码不会这样做,可能会在一个中间的某处破坏它们。我怀疑是你的 _csv.Error 的原因。



下面的代码避免了这样做通过处理输入文件作为一系列的线。我测试了它,它似乎独立工作的意义上,它把样本文件分成大约同样大小的块,因为不太可能整个行将完全适合一个块。 / p>

更新



版本的代码比我原来发布的。改进是因为它现在使用临时文件自己的 tell()方法来确定文件在写入时不断变化的长度,而不是调用 os.path.getsize(),这消除了 flush()文件并调用 os.fsync )

  import csv 
import multiprocessing
import os
import tempfile

def split(infilename,num_chunks = multiprocessing.cpu_count()):
READ_BUFFER = 2 ** 13
in_file_size = os。 path.getsize(infilename)
print'in_file_size:',in_file_size
chunk_size = in_file_size // num_chunks
print'target chunk_size:',chunk_size
files = []
with open(infilename,'rb',READ_BUFFER)as infile:
for _ in xrange(num_chunks):
temp_file = tempfile.TemporaryFile()
while temp_file.tell()< chunk_size:
try:
temp_file.write(infile.next())
除了StopIteration:#end of infile
break
temp_file.seek(0)#rewind
files.append(temp_file)
返回文件

files = split(sample_simple.csv,num_chunks = 4)
print' }文件格式(len(files))

for i,ifile in enumerate(files,start = 1):
print'temp of temp file {}:{}'。format (i,os.path.getsize(ifile.name))
打印文件的内容{}:'。format(i)
reader = csv.reader(ifile)
for row在阅读器中:
print row
print''


I have a large CSV file that I would like to split into a number that is equal to the number of CPU cores in the system. I want to then use multiprocess to have all the cores work on the file together. However, I am having trouble even splitting the file into parts. I've looked all over google and I found some sample code that appears to do what I want. Here is what I have so far:

def split(infilename, num_cpus=multiprocessing.cpu_count()):
    READ_BUFFER = 2**13
    total_file_size = os.path.getsize(infilename)
    print total_file_size
    files = list()
    with open(infilename, 'rb') as infile:
        for i in xrange(num_cpus):
            files.append(tempfile.TemporaryFile())
            this_file_size = 0
            while this_file_size < 1.0 * total_file_size / num_cpus:
                files[-1].write(infile.read(READ_BUFFER))
                this_file_size += READ_BUFFER
        files[-1].write(infile.readline()) # get the possible remainder
        files[-1].seek(0, 0)
    return files

files = split("sample_simple.csv")
print len(files)

for ifile in files:
    reader = csv.reader(ifile)
    for row in reader:
        print row

The two prints show the correct file size and that it was split into 4 pieces (my system has 4 CPU cores).

However, the last section of the code that prints all the rows in each of the pieces gives the error:

for row in reader:
_csv.Error: line contains NULL byte

I tried printing the rows without running the split function and it prints all the values correctly. I suspect the split function has added some NULL bytes to the resulting 4 file pieces but I'm not sure why.

Does anyone know if this a correct and fast method to split the file? I just want resulting pieces that can be read successfully by csv.reader.

解决方案

As I said in a comment, csv files would need to be split on row (or line) boundaries. Your code doesn't do this and potentially breaks them up somewhere in the middle of one — which I suspect is the cause of your _csv.Error.

The following avoids doing that by processing the input file as a series of lines. I've tested it and it seems to work standalone in the sense that it divided the sample file up into approximately equally size chunks because it's unlikely that an whole number of rows will fit exactly into a chunk.

Update

This it is a substantially faster version of the code than I originally posted. The improvement is because it now uses the temp file's own tell() method to determine the constantly changing length of the file as it's being written instead of calling os.path.getsize(), which eliminated the need to flush() the file and call os.fsync() on it after each row is written.

import csv
import multiprocessing
import os
import tempfile

def split(infilename, num_chunks=multiprocessing.cpu_count()):
    READ_BUFFER = 2**13
    in_file_size = os.path.getsize(infilename)
    print 'in_file_size:', in_file_size
    chunk_size = in_file_size // num_chunks
    print 'target chunk_size:', chunk_size
    files = []
    with open(infilename, 'rb', READ_BUFFER) as infile:
        for _ in xrange(num_chunks):
            temp_file = tempfile.TemporaryFile()
            while temp_file.tell() < chunk_size:
                try:
                    temp_file.write(infile.next())
                except StopIteration:  # end of infile
                    break
            temp_file.seek(0)  # rewind
            files.append(temp_file)
    return files

files = split("sample_simple.csv", num_chunks=4)
print 'number of files created: {}'.format(len(files))

for i, ifile in enumerate(files, start=1):
    print 'size of temp file {}: {}'.format(i, os.path.getsize(ifile.name))
    print 'contents of file {}:'.format(i)
    reader = csv.reader(ifile)
    for row in reader:
        print row
    print ''

这篇关于Python:将CSV文件拆分为相等的部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆