读取大块 pandas 中的多个CSV文件 [英] Read multiple CSV files in Pandas in chunks

查看:121
本文介绍了读取大块 pandas 中的多个CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我们有多个csv文件且所有csv的总大小约为20gb时,如何分批导入和读取多个CSV?

How to import and read multiple CSV in chunks when we have multiple csv files and total size of all csv is around 20gb?

我不想使用Spark,因为我想在SkLearn中使用模型,所以我想要Pandas本身的解决方案.

I don't want to use Spark as i want to use a model in SkLearn so I want the solution in Pandas itself.

我的代码是:

allFiles = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat((pd.read_csv(f,sep=",") for f in allFiles))
df.reset_index(drop=True, inplace=True)

但是这失败了,因为我的路径中所有csv的总大小为17gb.

But this is failing as the total size of all the csv in my path is 17gb.

我想分块读取它,但是如果尝试这样会出现一些错误:

I want to read it in chunks but I getting some error if I try like this:

  allFiles = glob.glob(os.path.join(path, "*.csv"))
  df = pd.concat((pd.read_csv(f,sep=",",chunksize=10000) for f in allFiles))
  df.reset_index(drop=True, inplace=True)

我得到的错误是:

无法连接类型为"的对象;只有pd.Series,pd.DataFrame和pd.Panel(已弃用)objs有效"

"cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid"

有人可以帮忙吗?

推荐答案

一种方法是使用pd.read_csv(file,chunksize = chunksize)对数据帧进行分块,然后读取的最后一块短于块大小,保存多余的位,然后将其添加到下一个块的第一个文件中.

One way to do this is to chunk the data frame with pd.read_csv(file, chunksize=chunksize) and then if the last chunk you read is shorter than the chunksize, save the extra bit and then add it onto the first file of the next chunk.

但是请确保读取下一个文件的较小的第一块,使其等于总块大小.

But making sure to read in a smaller first chunk of the next file so that it equals the total chunk size.

def chunk_from_files(dir, master_chunksize):
    '''
    Provided a directory, loops through files and chunks out dataframes.
    :param dir: Directory to csv files.
    :param master_chunksize: Size of chunk to output.
    :return: Dataframes with master_chunksize chunk.
    '''
    files = os.listdir(dir)

    chunksize = master_chunksize
    extra_chunk = None # Initialize the extra chunk.
    for file in files:
        csv_file = os.path.join(dir, file)

        # Alter chunksize if extra chunk is not None.
        if extra_chunk is not None:
            chunksize = master_chunksize - extra_chunk.shape[0]

        for chunk in pd.read_csv(csv_file, chunksize=chunksize):
            if extra_chunk is not None: 
                # Concatenate last small chunk of previous file with altered first chunk of next file.
                chunk = pd.concat([chunk, extra_chunk])
                extra_chunk = None
                chunksize = master_chunksize # Reset chunksize.
            elif chunk.shape[0] < chunksize:
                # If last chunk is less than chunk size, set is as the extra bit.
                extra_chunk = chunk
                break

            yield chunk

这篇关于读取大块 pandas 中的多个CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆