pandas 使用“更大"来附加性能连接/附加.数据框 [英] Pandas append perfomance concat/append using "larger" DataFrames

查看:72
本文介绍了 pandas 使用“更大"来附加性能连接/附加.数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:我将csv文件中的数据存储在以下列data/id/value中.我有15个文件,每个文件包含约10-20mio行.每个csv文件涵盖不同的时间段,因此时间索引不重叠,但各列为(新ID会不时输入,旧ID会消失).我最初所做的是在不执行数据透视调用的情况下运行脚本,但是后来我遇到了本地计算机(只有8GB)上的内存问题.由于每个文件中都有很多冗余,因此,pivot最初似乎是一个不错的出路(大约少了2/3数据),但现在性能有所提高.如果我运行以下脚本,则concat函数将永远"运行(我总是被打断)经过一段时间(2h>)手动进行. Concat/append似乎在大小方面有限制(我大约有10000-20000列),还是我在这里错过了什么?有什么建议吗?

The problem: I have data stored in csv file with the following columns data/id/value. I have 15 files each containing around 10-20mio rows. Each csv file covers a distinct period so the time indexes are non overlapping, but the columns are (new ids enter from time to time, old ones disappear). What I originally did was running the script without the pivot call, but then I run into memory issues on my local machine (only 8GB). Since there is lots of redundancy in each file, pivot seemd at first a nice way out (roughly 2/3 less data) but now perfomance kicks in. If I run the following script the concat function will run "forever" (I always interrupted manually so far after some time (2h>)). Concat/append seem to have limitations in terms of size (I have roughly 10000-20000 columns), or do I miss something here? Any suggestions?

import pandas as pd
path = 'D:\\'
data = pd.DataFrame()
#loop through list of raw file names
for file in raw_files:
    data_tmp = pd.read_csv(path + file, engine='c',
                           compression='gzip',
                           low_memory=False,
                           usecols=['date', 'Value', 'ID'])
    data_tmp = data_tmp.pivot(index='date', columns='ID',
                              values='Value')

    data = pd.concat([data,data_tmp])
    del data_tmp

编辑I:为澄清起见,每个csv文件大约有10-20mio行和三列,在应用数据透视后,该文件减少到约2000行,但导致10000列.

EDIT I:To clarify, each csv file has about 10-20mio rows and three columns, after pivot is applied this reduces to about 2000 rows but leads to 10000 columns.

我可以通过简单地将ID的完整集合拆分为子集并基于每个子集运行所需的计算来解决内存问题,因为它们对于每个ID是独立的.我知道这使我可以n次重新加载相同的文件,其中n是所使用的子集的数量,但这仍然是相当快的.我仍然想知道为什么append无法执行.

I can solve the memory issue by simply splitting the full-set of ids into subsets and run the needed calculations based on each subset as they are independent for each id. I know it makes me reload the same files n-times, where n is the number of subsets used, but this is still reasonable fast. I still wonder why append is not performing.

编辑II:我试图通过模拟来重新创建文件结构,该模拟尽可能接近实际数据结构.我希望很明显,我没有花太多时间来最小化仿真时间,但是它可以在我的机器上以合理的速度运行.

EDIT II: I have tried to recreate the file structure with a simulation, which is as close as possible to the actual data structure. I hope it is clear, I didn't spend to much time minimizing simulation-time, but it runs reasonable fast on my machine.

import string
import random
import pandas as pd
import numpy as np
import math

# Settings :-------------------------------
num_ids = 20000
start_ids = 4000
num_files = 10
id_interval = int((num_ids-start_ids)/num_files)
len_ids = 9
start_date = '1960-01-01'
end_date = '2014-12-31'
run_to_file = 2
# ------------------------------------------

# Simulation column IDs
id_list = []
# ensure unique elements are of size >num_ids
for x in range(num_ids + round(num_ids*0.1)):
    id_list.append(''.join(
        random.choice(string.ascii_uppercase + string.digits) for _
        in range(len_ids)))
id_list = set(id_list)
id_list = list(id_list)[:num_ids]

time_index = pd.bdate_range(start_date,end_date,freq='D')
chunk_size =  math.ceil(len(time_index)/num_files)

data = []
#  Simulate files
for file in range(0, run_to_file):
    tmp_time = time_index[file * chunk_size:(file + 1) * chunk_size]
    # TODO not all cases cover, make sure ints are obtained
    tmp_ids = id_list[file * id_interval:
        start_ids + (file + 1) * id_interval]

    tmp_data = pd.DataFrame(np.random.standard_normal(
        (len(tmp_time), len(tmp_ids))), index=tmp_time,
        columns=tmp_ids)

    tmp_file = tmp_data.stack().sortlevel(1).reset_index()
    # final simulated data structure of the parsed csv file
    tmp_file = tmp_file.rename(columns={'level_0': 'Date', 'level_1':
                                        'ID', 0: 'Value'})

    # comment/uncomment if pivot takes place on aggregate level or not
    tmp_file = tmp_file.pivot(index='Date', columns='ID',
                              values='Value')
    data.append(tmp_file)

data = pd.concat(data)
# comment/uncomment if pivot takes place on aggregate level or not
# data = data.pivot(index='Date', columns='ID', values='Value')

推荐答案

使用您的可复制示例代码,我确实可以确认仅两个数据帧的concat花费很长时间.但是,如果首先对齐它们(使列名相等),则过渡非常快:

Using your reproducible example code, I can indeed confirm that the concat of only two dataframes takes a very long time. However, if you first align them (make the column names equal), then concatting is very fast:

In [94]: df1, df2 = data[0], data[1]

In [95]: %timeit pd.concat([df1, df2])
1 loops, best of 3: 18min 8s per loop

In [99]: %%timeit
   ....: df1b, df2b = df1.align(df2, axis=1)
   ....: pd.concat([df1b, df2b])
   ....:
1 loops, best of 3: 686 ms per loop

两种方法的结果相同.
对齐等效于:

The result of both approaches is the same.
The aligning is equivalent to:

common_columns = df1.columns.union(df2.columns)
df1b = df1.reindex(columns=common_columns)
df2b = df2.reindex(columns=common_columns)

因此,当必须处理完整的数据帧列表时,这可能是更简单的方法.

So this is probably the easier way to use when having to deal with a full list of dataframes.

pd.concat较慢的原因是因为它做得更多.例如.当列名不相等时,它将检查每个列是否必须上载dtype或不保留NaN值(通过对齐列名来引入NaN值).通过调整自己,可以跳过此步骤.但是在这种情况下,如果您确定所有dtype都相同,那么这没问题.
这么慢也让我感到惊讶,但是我会提出一个问题.

The reason that pd.concat is slower is because it does more. E.g. when the column names are not equal, it checks for every column if the dtype has to be upcasted or not to hold the NaN values (which get introduced by aligning the column names). By aligning yourself, you skip this. But in this case, where you are sure to have all the same dtype, this is no problem.
That it is so much slower surprises me as well, but I will raise an issue about that.

这篇关于 pandas 使用“更大"来附加性能连接/附加.数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆