pandas 合并错误:MemoryError [英] Pandas Merge Error: MemoryError

查看:389
本文介绍了 pandas 合并错误:MemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将两个相对较小的数据集放在一起,但是合并会产生一个MemoryError.我有两个国家贸易数据汇总的数据集,我试图在关键年份和国家上进行合并,因此需要对数据进行特殊性设置.不幸的是,这使concat的使用成为不可能,并且从性能上也无法获得对这个问题的回答:

I'm trying to two relatively small datasets together, but the merge raises a MemoryError. I have two datasets of aggregates of country trade data, that I'm trying to merge on the keys year and country, so the data needs to be particularity placed. This unfortunately makes the use of concat and its performance benefits impossible as seen in the answer to this question: MemoryError on large merges with pandas in Python.

以下是设置:

尝试的合并:

df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])

基本数据结构:

i:

    Year    Reporter_Code   Trade_Flow_Code Partner_Code    Classification  Commodity Code  Quantity Unit Code  Supplementary Quantity  Netweight (kg)  Value   Estimation Code
0    2003    381     2   36  H2  070951  8   1274    1274    13810   0
1    2003    381     2   36  H2  070930  8   17150   17150   30626   0
2    2003    381     2   36  H2  0709    8   20493   20493   635840  0
3    2003    381     1   36  H2  0507    8   5200    5200    27619   0
4    2003    381     1   36  H2  050400  8   56439   56439   683104  0

df:

    mporter  cod     CC ComTrade_CC Distance_miles
0    110     215     215     757     428.989
1    110     215     215     757     428.989
2    110     215     215     757     428.989
3    110     215     215     757     428.989
4    110     215     215     757     428.989

错误回溯:

 MemoryError                      Traceback (most recent call last)
<ipython-input-10-8d6e9fb45de6> in <module>()
      1 for i in c_list:
----> 2     df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
     36                          right_index=right_index, sort=sort, suffixes=suffixes,
     37                          copy=copy)
---> 38     return op.get_result()
     39 if __debug__:
     40     merge.__doc__ = _merge_doc % '\nleft : DataFrame'

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
    193                                       copy=self.copy)
    194 
--> 195         result_data = join_op.get_result()
    196         result = DataFrame(result_data)
    197 

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
    693                 if klass in mapping:
    694                     klass_blocks.extend((unit, b) for b in mapping[klass])
--> 695             res_blk = self._get_merged_block(klass_blocks)
    696 
    697             # if we have a unique result index, need to clear the _ref_locs

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _get_merged_block(self, to_merge)
    706     def _get_merged_block(self, to_merge):
    707         if len(to_merge) > 1:
--> 708             return self._merge_blocks(to_merge)
    709         else:
    710             unit, block = to_merge[0]

/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _merge_blocks(self, merge_chunks)
    728         # Should use Fortran order??
    729         block_dtype = _get_block_dtype([x[1] for x in merge_chunks])
--> 730         out = np.empty(out_shape, dtype=block_dtype)
    731 
    732         sofar = 0

MemoryError: 

感谢您的想法!

推荐答案

如果遇到此问题的任何人仍然对merge有类似的麻烦,则可以通过重命名两个中的相关列来使concat正常工作数据框具有相同的名称,将它们设置为MultiIndex(即df = dv.set_index(['A','B'])),然后使用concat进行连接.

In case anyone coming across this question still has similar trouble with merge, you can probably get concat to work by renaming the relevant columns in the two dataframes to the same names, setting them as a MultiIndex (i.e. df = dv.set_index(['A','B'])), and then using concat to join them.

这篇关于 pandas 合并错误:MemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆