如何在Python中将csv文件拆分为大小相等的块? [英] How do you split a csv file into evenly sized chunks in Python?

查看:1016
本文介绍了如何在Python中将csv文件拆分为大小相等的块?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在一个基本的我有下一个过程。

In a basic I had the next process.

import csv
reader = csv.reader(open('huge_file.csv', 'rb'))

for line in reader:
    process_line(line)

查看此相关问题。我想每100行发送一个进程行,以实现批量分片。

See this related question. I want to send the process line every 100 rows, to implement batch sharding.

实现相关答案的问题是csv对象是不可订阅的,不能使用len。

The problem about implementing the related answer is that csv object is unsubscriptable and can not use len.

>>> import csv
>>> reader = csv.reader(open('dataimport/tests/financial_sample.csv', 'rb'))
>>> len(reader)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type '_csv.reader' has no len()
>>> reader[10:]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable
>>> reader[10]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable

/ p>

How can I solve this?

推荐答案

只需将您的读者 $ c> list 。显然,这将打破真正大的文件(见下面的更新中的替代品):

Just make your reader subscriptable by wrapping it into a list. Obviously this will break on really large files (see alternatives in the Updates below):

>>> reader = csv.reader(open('big.csv', 'rb'))
>>> lines = list(reader)
>>> print lines[:100]
...

进一步阅读:如何在Python中将列表拆分为大小相等的块?

更新1 (列表版本):另一种可能的方法是每个卡盘,因为它到达同时迭代的行:

Update 1 (list version): Another possible way would just process each chuck, as it arrives while iterating over the lines:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

chunk, chunksize = [], 100

def process_chunk(chuck):
    print len(chuck)
    # do something useful ...

for i, line in enumerate(reader):
    if (i % chunksize == 0 and i > 0):
        process_chunk(chunk)
        del chunk[:]
    chunk.append(line)

# process the remainder
process_chunk(chunk)






更新2 (生成器版本):我没有对其进行基准测试,但也许您可以使用chunk generator 提高性能:


Update 2 (generator version): I haven't benchmarked it, but maybe you can increase performance by using a chunk generator:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

def gen_chunks(reader, chunksize=100):
    """ 
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices. 
    """
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]
        chunk.append(line)
    yield chunk

for chunk in gen_chunks(reader):
    print chunk # process chunk

# test gen_chunk on some dummy sequence:
for chunk in gen_chunks(range(10), chunksize=3):
    print chunk # process chunk

# => yields
# [0, 1, 2]
# [3, 4, 5]
# [6, 7, 8]
# [9]

这篇关于如何在Python中将csv文件拆分为大小相等的块?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆