如何在Pandas中的超大型数据框上创建数据透视表 [英] How to create a pivot table on extremely large dataframes in Pandas

查看:167
本文介绍了如何在Pandas中的超大型数据框上创建数据透视表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从大约6000万行的数据集中创建一个2000列,大约30-50千万行的数据透视表.我曾尝试过旋转100,000行的数据块,但这种方法行得通,但是当我尝试通过先执行.append()然后再执行.groupby('someKey').sum()来重新组合DataFrame时,我的所有内存都被占用了和python最终崩溃.

I need to create a pivot table of 2000 columns by around 30-50 million rows from a dataset of around 60 million rows. I've tried pivoting in chunks of 100,000 rows, and that works, but when I try to recombine the DataFrames by doing a .append() followed by .groupby('someKey').sum(), all my memory is taken up and python eventually crashes.

如何以有限的RAM量处理如此大的数据?

How can I do a pivot on data this large with a limited ammount of RAM?

添加示例代码

以下代码在此过程中包括各种测试输出,但是最后打印出的是我们真正感兴趣的内容.请注意,如果将segMax更改为3(而不是4),则该代码将为正确输出产生误报.主要问题是,如果sum(wawa)所查看的每个块中都不存在一个shipmentid条目,则该条目不会出现在输出中.

The following code includes various test outputs along the way, but the last print is what we're really interested in. Note that if we change segMax to 3, instead of 4, the code will produce a false positive for correct output. The main issue is that if a shipmentid entry is not in each and every chunk that sum(wawa) looks at, it doesn't show up in the output.

import pandas as pd
import numpy as np
import random
from pandas.io.pytables import *
import os

pd.set_option('io.hdf.default_format','table') 

# create a small dataframe to simulate the real data.
def loadFrame():
    frame = pd.DataFrame()
    frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes
    frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test
    frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test
    return frame

def pivotSegment(segmentNumber,passedFrame):
    segmentSize = 3 #take 3 rows at a time
    frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF

    # ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values
    span = pd.DataFrame() 
    span['catid'] = range(1,5+1)
    span['shipmentid']=1
    span['qty']=0

    frame = frame.append(span)

    return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \
                             aggfunc='sum',fill_value=0).reset_index()

def createStore():

    store = pd.HDFStore('testdata.h5')
    return store

segMin = 0
segMax = 4

store = createStore()
frame = loadFrame()

print('Printing Frame')
print(frame)
print(frame.info())

for i in range(segMin,segMax):
    segment = pivotSegment(i,frame)
    store.append('data',frame[(i*3):(i*3 + 3)])
    store.append('pivotedData',segment)

print('\nPrinting Store')   
print(store)
print('\nPrinting Store: data') 
print(store['data'])
print('\nPrinting Store: pivotedData') 
print(store['pivotedData'])

print('**************')
print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('**************')
print('$$$')
for df in store.select('pivotedData',chunksize=3):
    print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum())

print('$$$')
store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3)))
print('\nPrinting Store: pivotedAndSummed') 
print(store['pivotedAndSummed'])

store.close()
os.remove('testdata.h5')
print('closed')

推荐答案

您可以使用HDF5/pytables进行附加.这样可以将其保留在RAM之外.

You could do the appending with HDF5/pytables. This keeps it out of RAM.

使用表格格式:

store = pd.HDFStore('store.h5')
for ...:
    ...
    chunk  # the chunk of the DataFrame (which you want to append)
    store.append('df', chunk)

现在您可以一次性将其作为DataFrame读取(假设此DataFrame可以容纳在内存中!):

Now you can read it in as a DataFrame in one go (assuming this DataFrame can fit in memory!):

df = store['df']

您还可以查询以仅获取DataFrame的子部分.

You can also query, to get only subsections of the DataFrame.

此外:您还应该购买更多的RAM,这很便宜.

Aside: You should also buy more RAM, it's cheap.

您可以从商店迭代,因为此地图减少"了大块:

you can groupby/sum from the store iteratively since this "map-reduces" over the chunks:

# note: this doesn't work, see below
sum(df.groupby().sum() for df in store.select('df', chunksize=50000))
# equivalent to (but doesn't read in the entire frame)
store['df'].groupby().sum()

Edit2:如上所述使用sum并不能在熊猫0.16中正常工作(我认为它在0.15.2中是有效的),而是可以使用

Using sum as above doesn't actually work in pandas 0.16 (I thought it did in 0.15.2), instead you can use reduce with add:

reduce(lambda x, y: x.add(y, fill_value=0),
       (df.groupby().sum() for df in store.select('df', chunksize=50000)))

在python 3中,您必须从functools中导入 .

也许将其写为:

chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000))
res = next(chunks)  # will raise if there are no chunks!
for c in chunks:
    res = res.add(c, fill_value=0)

如果性能不佳/如果有大量新组,则最好将res作为正确大小的零开始(通过获取唯一的组密钥,例如循环遍历各个块),然后添加到位.

这篇关于如何在Pandas中的超大型数据框上创建数据透视表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆