Python Pandas合并导致内存溢出 [英] Python Pandas Merge Causing Memory Overflow
问题描述
我是Pandas的新手,正在尝试合并一些数据子集.我给出了发生这种情况的具体情况,但问题是普遍的:它是如何/为什么发生以及如何解决呢?
I'm new to Pandas and am trying to merge a few subsets of data. I'm giving a specific case where this happens, but the question is general: How/why is it happening and how can I work around it?
我加载的数据大约为85 Meg,但是我经常看到我的python会话运行的内存使用率接近10 gig,然后给出内存错误.
我不知道为什么会这样,但是由于我什至无法开始以自己想要的方式查看数据,所以这真让我丧命.
I have no idea why this happens, but it's killing me as I can't even get started looking at the data the way I want to.
这就是我所做的:
导入主数据
import requests, zipfile, StringIO
import numpy as np
import pandas as pd
STAR2013url="http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013_all_csv_v3.zip"
STAR2013fileName = 'ca2013_all_csv_v3.txt'
r = requests.get(STAR2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STAR2013=pd.read_csv(z.open(STAR2013fileName))
导入一些交叉引用表
STARentityList2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013entities_csv.zip"
STARentityList2013fileName = "ca2013entities_csv.txt"
r = requests.get(STARentityList2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARentityList2013=pd.read_csv(z.open(STARentityList2013fileName))
STARlookUpTestID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/tests.zip"
STARlookUpTestID2013fileName = "Tests.txt"
r = requests.get(STARlookUpTestID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpTestID2013=pd.read_csv(z.open(STARlookUpTestID2013fileName))
STARlookUpSubgroupID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/subgroups.zip"
STARlookUpSubgroupID2013fileName = "Subgroups.txt"
r = requests.get(STARlookUpSubgroupID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpSubgroupID2013=pd.read_csv(z.open(STARlookUpSubgroupID2013fileName))
重命名列ID以允许合并
STARlookUpSubgroupID2013 = STARlookUpSubgroupID2013.rename(columns={'001':'Subgroup ID'})
STARlookUpSubgroupID2013
成功合并
merged = pd.merge(STAR2013,STARlookUpSubgroupID2013, on='Subgroup ID')
尝试第二次合并.这是发生内存溢出的地方
Try a second merge. This is where the Memory Overflow Happens
merged=pd.merge(merged, STARentityList2013, on='School Code')
我在ipython笔记本中完成了所有这些操作,但不要认为这会改变任何东西.
I did all of this in ipython notebook, but don't think that changes anything.
推荐答案
尽管这是一个老问题,但我最近遇到了同样的问题.
Although this is an old question, I recently came across the same problem.
在我的实例中,两个数据帧中都需要重复的键,并且我需要一种方法,该方法可以判断合并是否可以在计算之前放入内存,否则,请更改计算方法.
In my instance, duplicate keys are required in both dataframes, and I needed a method which could tell if a merge will fit into memory ahead of computation, and if not, change the computation method.
我想出的方法如下:
def merge_size(left_frame, right_frame, group_by, how='inner'):
left_groups = left_frame.groupby(group_by).size()
right_groups = right_frame.groupby(group_by).size()
left_keys = set(left_groups.index)
right_keys = set(right_groups.index)
intersection = right_keys & left_keys
left_diff = left_keys - intersection
right_diff = right_keys - intersection
left_nan = len(left_frame[left_frame[group_by] != left_frame[group_by]])
right_nan = len(right_frame[right_frame[group_by] != right_frame[group_by]])
left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan
sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
sizes += [left_nan * right_nan]
left_size = [left_groups[group_name] for group_name in left_diff]
right_size = [right_groups[group_name] for group_name in right_diff]
if how == 'inner':
return sum(sizes)
elif how == 'left':
return sum(sizes + left_size)
elif how == 'right':
return sum(sizes + right_size)
return sum(sizes + left_size + right_size)
注意:
目前,使用此方法,键只能是标签,而不能是列表.当前使用group_by
列表返回列表中每个标签的合并大小总和.这将导致合并大小远大于实际合并大小.
Note:
At present with this method, the key can only be a label, not a list. Using a list for group_by
currently returns a sum of merge sizes for each label in the list. This will result in a merge size far larger than the actual merge size.
如果您要为group_by使用标签列表,则最后一行的大小为:
If you are using a list of labels for the group_by, the final row size is:
min([merge_size(df1, df2, label, how) for label in group_by])
检查它是否适合内存
此处定义的merge_size
函数返回通过将两个数据帧合并在一起而创建的行数.
Check if this fits in memory
The merge_size
function defined here returns the number of rows which will be created by merging two dataframes together.
通过将其与两个数据帧中的列数相乘,然后乘以np.float [32/64]的大小,您可以大致了解结果数据帧在内存中的大小.然后可以将其与 psutil.virtual_memory().available
进行比较,以查看您的系统是否可以计算完全合并. /p>
By multiplying this with the count of columns from both dataframes, then multiplying by the size of np.float[32/64], you can get a rough idea of how large the resulting dataframe will be in memory. This can then be compared against psutil.virtual_memory().available
to see if your system can calculate the full merge.
def mem_fit(df1, df2, key, how='inner'):
rows = merge_size(df1, df2, key, how)
cols = len(df1.columns) + (len(df2.columns) - 1)
required_memory = (rows * cols) * np.dtype(np.float64).itemsize
return required_memory <= psutil.virtual_memory().available
在本期中,已提出merge_size
方法作为pandas
的扩展. https://github.com/pandas-dev/pandas/issues/15068.
The merge_size
method has been proposed as an extension of pandas
in this issue. https://github.com/pandas-dev/pandas/issues/15068.
这篇关于Python Pandas合并导致内存溢出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!