你如何在 Python 中将读取一个大的 csv 文件分成大小均匀的块? [英] How do you split reading a large csv file into evenly-sized chunks in Python?

查看:23
本文介绍了你如何在 Python 中将读取一个大的 csv 文件分成大小均匀的块?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在一个基本的过程中,我进行了下一个过程.

导入csvreader = csv.reader(open('huge_file.csv', 'rb'))对于阅读器中的行:process_line(线)

看到这个相关的问题.我想每100行发送一次处理线,以实现批量分片.

实现相关答案的问题是csv对象不可下标,不能使用len.

<预><代码>>>>导入 csv>>>reader = csv.reader(open('dataimport/tests/financial_sample.csv', 'rb'))>>>伦(读者)回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中类型错误:'_csv.reader' 类型的对象没有 len()>>>读者[10:]回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中类型错误:'_csv.reader' 对象不可订阅>>>读者[10]回溯(最近一次调用最后一次):文件<stdin>",第 1 行,在 <module> 中类型错误:'_csv.reader' 对象不可订阅

我该如何解决这个问题?

解决方案

只需将您的 reader 包装到 list 中即可下标.显然,这会在非常大的文件上中断(请参阅下面更新中的替代方案):

<预><代码>>>>reader = csv.reader(open('big.csv', 'rb'))>>>行 = 列表(读者)>>>打印行[:100]...

进一步阅读:怎么做你在 Python 中将一个列表分成大小均匀的块?

<小时>

更新 1(列表版本):另一种可能的方法是处理每个卡盘,因为它在迭代行时到达:

#!/usr/bin/env python导入 csvreader = csv.reader(open('4956984.csv', 'rb'))块,块大小 = [], 100def process_chunk(chuck):打印镜头(夹头)# 做一些有用的事...对于我,枚举中的行(阅读器):if (i % chunksize == 0 and i > 0):进程块(块)del chunk[:] # 或: chunk = []块.附加(行)# 处理余数进程块(块)

<小时>

更新 2(生成器版本):我没有对其进行基准测试,但也许您可以通过使用块 生成器 来提高性能:

#!/usr/bin/env python导入 csvreader = csv.reader(open('4956984.csv', 'rb'))def gen_chunks(reader, chunksize=100):"""块生成器.拿一个 CSV `reader` 并产生`chunksize` 大小的切片."""块 = []对于我,枚举中的行(阅读器):if (i % chunksize == 0 and i > 0):产量块del chunk[:] # 或: chunk = []块.附加(行)产量块对于 gen_chunks(reader) 中的块:打印块#处理块# 在一些虚拟序列上测试 gen_chunk:对于 gen_chunks(range(10), chunksize=3) 中的块:打印块#处理块# =>产量# [0, 1, 2]# [3, 4, 5]# [6, 7, 8]# [9]

有一个小问题,如 @totalhack 指出:

<块引用>

请注意,这会一遍又一遍地产生具有不同内容的相同对象.如果您计划在每次迭代之间对块执行所需的一切操作,这将很有效.

In a basic I had the next process.

import csv
reader = csv.reader(open('huge_file.csv', 'rb'))

for line in reader:
    process_line(line)

See this related question. I want to send the process line every 100 rows, to implement batch sharding.

The problem about implementing the related answer is that csv object is unsubscriptable and can not use len.

>>> import csv
>>> reader = csv.reader(open('dataimport/tests/financial_sample.csv', 'rb'))
>>> len(reader)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type '_csv.reader' has no len()
>>> reader[10:]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable
>>> reader[10]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable

How can I solve this?

解决方案

Just make your reader subscriptable by wrapping it into a list. Obviously this will break on really large files (see alternatives in the Updates below):

>>> reader = csv.reader(open('big.csv', 'rb'))
>>> lines = list(reader)
>>> print lines[:100]
...

Further reading: How do you split a list into evenly sized chunks in Python?


Update 1 (list version): Another possible way would just process each chuck, as it arrives while iterating over the lines:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

chunk, chunksize = [], 100

def process_chunk(chuck):
    print len(chuck)
    # do something useful ...

for i, line in enumerate(reader):
    if (i % chunksize == 0 and i > 0):
        process_chunk(chunk)
        del chunk[:]  # or: chunk = []
    chunk.append(line)

# process the remainder
process_chunk(chunk)


Update 2 (generator version): I haven't benchmarked it, but maybe you can increase performance by using a chunk generator:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

def gen_chunks(reader, chunksize=100):
    """ 
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices. 
    """
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]  # or: chunk = []
        chunk.append(line)
    yield chunk

for chunk in gen_chunks(reader):
    print chunk # process chunk

# test gen_chunk on some dummy sequence:
for chunk in gen_chunks(range(10), chunksize=3):
    print chunk # process chunk

# => yields
# [0, 1, 2]
# [3, 4, 5]
# [6, 7, 8]
# [9]

There is a minor gotcha, as @totalhack points out:

Be aware that this yields the same object over and over with different contents. This works fine if you plan on doing everything you need to with the chunk between each iteration.

这篇关于你如何在 Python 中将读取一个大的 csv 文件分成大小均匀的块?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆