在Python中将大型文件拆分为较小的文件-打开的文件过多 [英] Split really large file into smaller files in Python - Too many open files

查看:171
本文介绍了在Python中将大型文件拆分为较小的文件-打开的文件过多的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的csv文件(接近1 TB),我希望根据每一行的信息将其拆分为较小的csv文件.

I have a really large csv file (close to a Terabyte) that I want to split into smaller csv files, based on info in each row.

由于无法在内存中执行此操作,因此我打算使用的方法是读取每一行,确定应将其放入哪个文件,然后将其附加到那里.但是,这花费了很长时间,因为打开和关闭花费的时间太长.

Since there is no way to do that in memory, my intended approach was to read each line, decide which file it should go into, and append it there. This however takes ages, since opening and closing takes too long.

我的第二种方法是保持所有文件(大约3000个)处于打开状态-但这不能正常工作,因为我不能并行打开这么多文件.

My second approach was to keep all files (about 3000) open - this however does not work since I can't have so many files open in parallel.

根据要求提供的其他详细信息:.csv文件包含我需要按区域访问的地图数据.因此,我计划将其群集到覆盖不同边界框的文件中.由于它是未排序的数据,因此我必须处理每行的经/纬度,为其分配正确的文件,然后将该行附加到文件中.

Additional details as requested: The .csv file contains map data I need to access region-wise. Therefore, I plan on clustering it into files covering different bounding boxes. Since it is unsorted data, I have to process the lat/lon of each row, assign the correct file to it, and append the row to the file.

什么是可行的(快速,理想的)方法?

What would be a working (fast, ideally) approach for that?

推荐答案

这可能有点怪异,但需要pandas并进行一些批处理的追加.这将解决在每次行处理期间都必须打开和关闭文件的问题.我将假设您将行分类为CSV的方式是基于大型CSV列中的某个值.

This may be somewhat of a hacky method but it would require pandas and doing some batched appends. This will solve the issue with having to open and close files during every row processing. I'm going to assume that the way you triage the rows to your CSV's is based on some value from a column in your large CSV.

import pandas as pd
import os

df_chunked = pd.read_csv("myLarge.csv", chunksize=30000)  # you can alter the chunksize

for chunk in df_chunked:
    uniques = chunk['col'].unique().tolist()
    for val in uniques:
        df_to_write = chunk[chunk['col'] == val]
        if os.path.isfile('small_{}.csv'.format(val)):  # check if file already exists
            df_to_write.to_csv('small_{}.csv'.format(val), mode='a', index=False, header=False)
        else:
            df_to_write.to_csv('small_{}.csv'.format(val), index=False)

这篇关于在Python中将大型文件拆分为较小的文件-打开的文件过多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆