pandas :合并在collections.Count列上(甚至是dict)对象? [英] pandas: merge on column of collections.Counter (or even just dict) objects?

查看:65
本文介绍了 pandas :合并在collections.Count列上(甚至是dict)对象?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用带有collections.Counter对象的列( https://docs.python.org/2/library/collections.html#collections.Counter ).合并会引发一个奇怪的错误.请参见下面的可执行代码示例.

I need to perform a merge of two pandas DataFrames using columns with collections.Counter objects (https://docs.python.org/2/library/collections.html#collections.Counter). The merge raises a weird error. See executable code example below.

import pandas as pd
from collections import Counter
a = pd.DataFrame([(120000.0, 120000.0, 0.0, 120000.0),
 (120000.0, 280000.0, 120000.0, 120000.0),
 (280000.0, 280000.0, 120000.0, 280000.0),
 (280000.0, 420000.0, 280000.0, 280000.0),
 (420000.0, 420000.0, 280000.0, 420000.0),
 (420000.0, 500000.0, 420000.0, 420000.0),
 (500000.0, 580000.0, 420000.0, 500000.0),
 (580000.0, 820000.0, 500000.0, 580000.0),
 (820000.0, 860000.0, 580000.0, 820000.0),
 (860000.0, 1160000.0, 820000.0, 860000.0),
 (1160000.0, 1160000.0, 860000.0, 1160000.0)])
b = pd.DataFrame([(120000.0, 120000.0, 0.0, 120000.0),
 (120000.0, 280000.0, 120000.0, 120000.0),
 (280000.0, 280000.0, 120000.0, 280000.0),
 (280000.0, 440000.0, 280000.0, 280000.0),
 (440000.0, 440000.0, 280000.0, 440000.0),
 (440000.0, 520000.0, 440000.0, 440000.0),
 (520000.0, 580000.0, 440000.0, 520000.0),
 (580000.0, 820000.0, 520000.0, 580000.0),
 (820000.0, 860000.0, 580000.0, 820000.0),
 (860000.0, 1120000.0, 820000.0, 860000.0),
 (1120000.0, 1160000.0, 860000.0, 1120000.0)])
a['ID'] = [Counter(i) for i in list(a.values)]
b['ID'] = [Counter(i) for i in list(b.values)]
pd.merge(a, b, on='ID')

这将返回:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 601, in runfile
    execfile(filename, namespace)
  File "/usr/local/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 73, in execfile
    builtins.execfile(filename, *where)
  File "/home/ilya/tmp/tmp_merge.py", line 33, in <module>
    pd.merge(a, b, on='ID')
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 38, in merge
    return op.get_result()
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 186, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 273, in _get_join_info
    sort=self.sort, how=self.how)
  File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 461, in _get_join_indexers
    llab, rlab, shape = map(list, zip( * map(fkeys, left_keys, right_keys)))
TypeError: type object argument after * must be a sequence, not itertools.imap

我尝试将Counter对象转换为普通字典(即

I tried converting Counter objects to normal dicts (i.e.

b['ID'] = [dict(Counter(i)) for i in list(b.values)]

),但没有帮助.这是正常行为吗?如果是,我该如何避免此错误?还是有其他方法可以达到相同的最终结果?

), but it didn't help. Is this normal behaviour? If yes, how do I circumvent this error? Or is there any other way to achieve the same end result?

我使用python 2.7和pandas 0.16.1(通常使用ipython笔记本,但这也已在python中进行了测试.)

I use python 2.7 and pandas 0.16.1 (and normally ipython notebook, but this was tested in just python as well).

为了澄清所有这些是为了什么. 我需要基于两对列的值进行合并.在实际数据中,它们是Start1,End1,Start2,End2. End2> Start2,End1> Start1.这个例子是我真实价值的一个子集.问题在于,在两个数据集中可能出现(Start1_1,End1_1)==(Start2_2,End2_2)和(Start1_2,End1_2)==(Start2_1,End2_1)的情况;我也希望将这些行合并(第二个数字表示数据集).我认为使用此类计数器应该是最简单的解决方案,而且我很确定这种方式不会出现误报.

To clarify what all this is for. I need to merge based on values of two pairs of columns. In the real data they are Start1, End1, Start2, End2. End2>Start2, End1>Start1. The example is with a subset of my real values. The problem is that in two datasets may be a situation that (Start1_1, End1_1)==(Start2_2, End2_2) and (Start1_2, End1_2)==(Start2_1, End2_1); I want these lines to be merged as well (the second number denotes the dataset). I thought using such counters should be the easiest solution, and I am pretty sure there will be no false positives this way.

推荐答案

一种解决方法是为转换为可哈希类型的原始数据结构的版本创建一个列(针对每个DataFrame).

One way of getting around this is to create a column (for each DataFrame) of a version of your original data structure converted to a hashable type.

例如,

a['IDHash'] = a.ID.apply(lambda r: tuple(sorted(r.iteritems())))
b['IDHash'] = b.ID.apply(lambda r: tuple(sorted(r.iteritems())))

然后

pd.merge(a, b, on='IDHash')

之后,只需擦除列即可.

After that, just erase the columns.

这篇关于 pandas :合并在collections.Count列上(甚至是dict)对象?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆