以最小的内存占用分割大型Pandas Dataframe [英] Splitting a large Pandas Dataframe with minimal memory footprint

查看:537
本文介绍了以最小的内存占用分割大型Pandas Dataframe的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的DataFrame,我想分成一个测试集和一个模型建立的火车。但是,我不想复制DataFrame,因为我达到了内存限制。



是否有一个操作,类似于pop,而是一个大的段,将同时删除DataFrame的一部分,并允许我将其分配给一个新的DataFrame?这样做:

 #假设我初始化了一个DataFrame(称为全部),其中包含我的大型数据集
#带有一个名为test的布尔列,表示是否应该为
#测试使用记录。
print len(all)
#10000000
test = all.pop_large_segment(all [test])#不是一个真正的命令,只是一个占位符
print len(all)
#8000000
print len(test)
#2000000


解决方案

如果您有空间添加一列,可以添加一个随机值,然后您可以过滤以进行测试。在这里,我使用0和1之间的制服,但是如果你想要不同的比例,你可以使用任何东西。

  df = pd.DataFrame ({'one':[1,2,3,4,5,4,3,2,1],'two':[6,7,8,9,10,9,8,7,6] 'three':[11,12,13,14,15,14,13,12,11]})
df ['split'] = np.random.randint(0,2,size = len df))

当然这需要你有空间添加一个全新的列 - 特别是如果你的数据是很长的,也许你不需要。



另一个选项可以工作,例如,如果你的数据是csv格式,你知道行数。使用 randomint 做类似的操作,但将该列表传递到Panda的 skiprows 参数 read_csv()

  num_rows = 100000 
all = range(num_rows)

some = np.random.choice(all,replace = False,size = num_rows / 2)
some.sort()
trainer_df = pd.read_csv(path,skiprows =一些)

rest = [如果我没有在某些情况下,我为所有人
rest.sort()
df = pd.read_csv(path,skiprows = rest)

前面有点笨拙,特别是列表理解中的循环,并在内存中创建这些列表是不幸的,但它应该仍然是更好的内存范围,而不仅仅是创建一半的数据的整个副本。



为了使内存更加友善,您可以加载培训师子集,训练模型,然后用其余数据覆盖训练数据框,然后应用模型。您将被卡住,携带一些休息,但您永远不必加载两个数据同时。


I have a large DataFrame, which I would like to split into a test set and a train set for model building. However, I do not want to duplicate the DataFrame because I am reaching a memory limit.

Is there an operation, similar to pop but for a large segment, that will simultaneously remove a portion of the DataFrame and allow me to assign it to a new DataFrame? Something like this:

# Assume I have initialized a DataFrame (called "all") which contains my large dataset, 
# with a boolean column called "test" which indicates whether a record should be used for
# testing.
print len(all)
# 10000000 
test = all.pop_large_segment(all[test]) # not a real command, just a place holder
print len(all)
# 8000000
print len(test)     
# 2000000

解决方案

If you have the space to add one more column, you could add one with a random value that you could then filter on for your testing. Here I used uniform between 0 and 1, but you could use anything if you wanted a different proportion.

df = pd.DataFrame({'one':[1,2,3,4,5,4,3,2,1], 'two':[6,7,8,9,10,9,8,7,6], 'three':[11,12,13,14,15,14,13,12,11]})
df['split'] = np.random.randint(0, 2, size=len(df))

Of course that requires you have space to add an entirely new column - especially if your data is very long, maybe you don't.

Another option would work, for example, if your data was in csv format and you knew the number of rows. Do similar to the above with the randomint, but pass that list into the skiprows argument of Pandas read_csv():

num_rows = 100000
all = range(num_rows)

some = np.random.choice(all, replace=False, size=num_rows/2)
some.sort()
trainer_df = pd.read_csv(path, skiprows=some)

rest = [i for i in all if i not in some]
rest.sort()
df = pd.read_csv(path, skiprows=rest)

It's a little clunky up front, especially with the loop in the list comprehension, and creating those lists in memory is unfortunate, but it should still be better memory-wide than just creating an entire copy of half the data.

To make it even more memory friendly you could load the trainer subset, train the model, then overwrite the training dataframe with the rest of the data, then apply the model. You'll be stuck carrying some and rest around, but you'll never have to load both halves of the data at the same time.

这篇关于以最小的内存占用分割大型Pandas Dataframe的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆