如何在Python中将读取的大型csv文件拆分为均匀大小的块? [英] How do you split reading a large csv file into evenly-sized chunks in Python?

查看:102
本文介绍了如何在Python中将读取的大型csv文件拆分为均匀大小的块?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基本上,我要进行下一个过程.

In a basic I had the next process.

import csv
reader = csv.reader(open('huge_file.csv', 'rb'))

for line in reader:
    process_line(line)

请参阅此相关的问题.我想每100行发送一次生产线,以实现批量分片.

See this related question. I want to send the process line every 100 rows, to implement batch sharding.

实现相关答案的问题是csv对象无法下标并且不能使用len.

The problem about implementing the related answer is that csv object is unsubscriptable and can not use len.

>>> import csv
>>> reader = csv.reader(open('dataimport/tests/financial_sample.csv', 'rb'))
>>> len(reader)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type '_csv.reader' has no len()
>>> reader[10:]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable
>>> reader[10]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable

我该如何解决?

推荐答案

只需将您的reader通过将其包装到list中即可使其可下标.显然,这会在非常大的文件上中断(请参见下面的更新中的替代方法):

Just make your reader subscriptable by wrapping it into a list. Obviously this will break on really large files (see alternatives in the Updates below):

>>> reader = csv.reader(open('big.csv', 'rb'))
>>> lines = list(reader)
>>> print lines[:100]
...

进一步阅读:如何做您将列表拆分成Python中大小均匀的块吗?

更新1 (列表版本):另一种可能的方法是处理每个卡盘,因为它们在遍历行的同时到达:

Update 1 (list version): Another possible way would just process each chuck, as it arrives while iterating over the lines:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

chunk, chunksize = [], 100

def process_chunk(chuck):
    print len(chuck)
    # do something useful ...

for i, line in enumerate(reader):
    if (i % chunksize == 0 and i > 0):
        process_chunk(chunk)
        del chunk[:]  # or: chunk = []
    chunk.append(line)

# process the remainder
process_chunk(chunk)


更新2 (生成器版本):我尚未对其进行基准测试,但是也许您可以使用大块 generator 来提高性能:


Update 2 (generator version): I haven't benchmarked it, but maybe you can increase performance by using a chunk generator:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

def gen_chunks(reader, chunksize=100):
    """ 
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices. 
    """
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]  # or: chunk = []
        chunk.append(line)
    yield chunk

for chunk in gen_chunks(reader):
    print chunk # process chunk

# test gen_chunk on some dummy sequence:
for chunk in gen_chunks(range(10), chunksize=3):
    print chunk # process chunk

# => yields
# [0, 1, 2]
# [3, 4, 5]
# [6, 7, 8]
# [9]

有一个小问题,如 @totalhack

请注意,这会一遍又一遍地产生具有不同内容的相同对象.如果您计划在每次迭代之间使用大块来完成所需的一切,那么这会很好.

Be aware that this yields the same object over and over with different contents. This works fine if you plan on doing everything you need to with the chunk between each iteration.

这篇关于如何在Python中将读取的大型csv文件拆分为均匀大小的块?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆