将1300个数据帧合并为一个帧变得非常缓慢 [英] Merging 1300 data frames into a single frame becomes really slow

查看:41
本文介绍了将1300个数据帧合并为一个帧变得非常缓慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目录中有1300个csv文件.

每个文件的第一列都有一个日期,然后是最近20到30年的每日数据,涵盖了另外8列.

这样,Data1.csv

日期<源>源1源2源3源4源5源6源7源8

我有1300个唯一命名的文件.

我正在尝试使用像这样的大熊猫将所有这些合并到一个数据帧中

 将pandas导入为pd框架= pd.DataFrame()长度= len(os.listdir(filepath))用于os.listdir(filepath)中的文件名:file_path = os.path.join(文件路径,文件名)print(length,end =")df = pd.read_csv(file_path,index_col = 0)df = pd.concat([df [[col]].assign(Source = f'{filename [:-4]}-{col}').rename(columns = {col:'Data'})for col indf])框架= frame.append(df)长度= 1 

但是在第300个文件中,我大约有1200万行,并且我的代码确实放慢了速度...

在我的计算机内存不足之前,有没有一种方法可以加快速度.

我的目标实际上是拥有一个庞大的数据框,按30年的日期数计算,其值为1+(1300x8).

解决方案

循环变慢的原因是因为每个 .append()数据框都必须创建一个副本才能分配更多的内存,如此处所述.

如果您的内存可以容纳所有内容,则可以先用所有数据帧填充固定大小(1300)的列表,然后使用 df = pd.concat(list_of_dataframes),这可能会避免您当前遇到的问题.您的代码可以这样调整:

 将pandas导入为pdlst = [范围内的_无效(1300)]#创建空列表对于我来说,enumerate(os.listdir(filepath))中的文件名:file_path = os.path.join(文件路径,文件名)df = pd.read_csv(file_path,index_col = 0)df = pd.concat([df [[col]].assign(Source = f'{filename [:-4]}-{col}').rename(columns = {col:'Data'})for col indf])lst [i] = df框架= pd.concat(lst) 

I have 1300 csv files in a directory.

Each file has a date in the first column, followed by daily data for the last 20-30 years which spans another 8 columns.

So like this, Data1.csv

Date source1 source2 source3 source4 source5 source6 source 7 source 8

I have 1300 uniquely named files.

I am trying to merge all of these into one dataframe using pandas like this

import pandas as pd 
frame = pd.DataFrame()

length = len(os.listdir(filepath))
for filename in os.listdir(filepath):
    file_path = os.path.join(filepath, filename)
    print(length,end=" ")
    df = pd.read_csv(file_path,index_col=0)
    df = pd.concat([df[[col]].assign(Source=f'{filename[:-4]}-{col}').rename(columns={col: 'Data'}) for col in df])
    frame = frame.append(df)
    length-=1

But around the 300th file I have around 12 million rows and my code really slows down...

Is there a way to speed this up before my computer runs out of memory.

My goal is actually to have a massive dataframe which is 1+ (1300x8) by number of dates for 30 years.

解决方案

The reason your loop slows down is because of at each .append(), the dataframe has to create a copy in order to allocate more memory, as described here.

If your memory can fit it all, you could first fill a list of fixed size(1300) with all data frames, and then use df = pd.concat(list_of_dataframes), which would probably avoid the issue you are having right now. Your code could be adjusted as such:

import pandas as pd 
lst = [None for _ in range(1300)] # Creates empty list

for i, filename in enumerate(os.listdir(filepath)):
    file_path = os.path.join(filepath, filename)
    df = pd.read_csv(file_path,index_col=0)
    df = pd.concat([df[[col]].assign(Source=f'{filename[:-4]}-{col}').rename(columns={col: 'Data'}) for col in df])
    lst[i] = df
    

frame = pd.concat(lst)

这篇关于将1300个数据帧合并为一个帧变得非常缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆