大 pandas 有效地将列压缩成具有元组列表的列 [英] pandas efficiently compress columns into column with lists of tuples

查看:47
本文介绍了大 pandas 有效地将列压缩成具有元组列表的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,代表帐户持有人之间的交易组。数据显示交互的帐户和交换的项目。有时存在明确的匹配项,但有时交换的项目总数匹配,但您无法确切知道个人之间交换了多少金额。

I have a Dataframe representing groups of exchanges between account holders. The data shows the interacting accounts and items exchanged. Sometimes there is a clear match but sometimes the totals of items exchanged match but you can't tell exactly what amount was exchanged between individuals.

所需的输入输出如下:

  id group   rx   tx
0  A     x   50    0
1  B     x    0   50
2  A     y  210    0
3  B     y    0   50
4  C     y    0  350
5  D     y  190    0
  group                                          exchanges
0     x                                       [(B, A, 50)]
1     y  [(unk, A, 210), (B, unk, 50), (C, unk, 350), (unk, D, 190)]

当前我正在使用 groupby和 apply,例如:

Currently I'm using 'groupby' and 'apply' like this:

def sort_out(x):
  # create the row to be returned
  y = pd.Series(index=['group','exchanges'])
  y['group'] = x.group.iloc[0]
  y['exchanges'] = []

  # Find all rx and make tuples list
  # determine source and destinations
  sink = [tuple(i) for i in x.loc[x['rx'] != 0][[
      'id', 'rx'
  ]].to_records(index=True)]
  source = [tuple(i) for i in x.loc[x['tx'] != 0][[
      'id', 'tx'
  ]].to_records(index=True)] 

  # find match
  match = []
  for item in source:
      match = [o for o in sink if o[2] == item[2]]
      if len(match):
          y['exchanges'].append((item[1], match[0][1], match[0][2]))
          sink.remove(match[0])
          continue

  # handle the unmatched elements
  tx_el = x.loc[~x['tx'].isin(x['rx'])][[
      'id', 'tx']].to_records(index=True)
  rx_el = x.loc[~x['rx'].isin(x['tx'])][[
      'id', 'rx']].to_records(index=True)

  [y['exchanges'].append((item[1], 'unk', item[2])) for item in tx_el]
  [y['exchanges'].append(('unk', item[1], item[2])) for item in rx_el]

  return y

b = a.groupby('group').apply(lambda x: sort_out(x))

这种方法在大约2000万行中最多需要7个小时。我认为最大的障碍是 groupby- apply。最近有人向我介绍了爆炸。从那里我看到了融化,但似乎并没有找到我想要的。有任何改进建议吗?

This approach takes at best 7 hours on a ~20 million rows. I think the big hurdle is 'groupby'-'apply'. I was recently introduced to 'explode'. From there I looked at 'melt' but it doesn't seem to what I'm looking for. Any suggestions for improvements?

[另一个尝试]

基于YOBEN_S的建议,我尝试了以下操作。挑战的一部分是匹配,一部分是跟踪哪个正在发送(tx)和哪个正在接收(rx)。因此,我通过明确添加标签(即方向['dir'])作弊。我也使用嵌套的三元数,但是我不确定这是否非常有效:

Based on YOBEN_S suggestions I tried the following. Part of the challenge is matching, part is keeping track of which is transmitting (tx) and which is receiving (rx). So I cheat by adding a tag explicitly i.e. direction ['dir']. I also use a nested ternary but I'm not sure if that's very performant:

a['dir'] = a.apply(lambda x: 't' if x['tx'] !=0 else 'r', axis=1)
a[['rx','tx']]=np.sort(a[['rx','tx']].values,axis=1)

out = a.drop(['group','rx'],1).apply(tuple,1).groupby([a['group'],a.tx]).agg('sum') \
   .apply(lambda x: (x[3],x[0],x[1]) if len(x)==6 else  
     ((x[0],'unk',x[1]) if x[2]=='t' else ('unk',x[0],x[1]))
    ).groupby(level=0).agg(list)


推荐答案

我们可以尝试

out=df.drop('group',1).apply(tuple,1).groupby(df['group']).agg(list).to_frame('exchange').reset_index()
  group                                           exchange
0     x                           [(A, 50, 0), (B, 0, 50)]
1     y  [(A, 210, 0), (B, 0, 50), (C, 0, 350), (D, 190...

更新

df[['rx','tx']]=np.sort(df[['rx','tx']].values,axis=1)
out=df.drop(['group','rx'],1).apply(list,1).groupby([df['group'],df.tx]).agg('sum').apply(set).groupby(level=0).agg(list)
out
group
x                               [{50, A, B}]
y    [{50, B}, {D, 190}, {210, A}, {C, 350}]
dtype: object

这篇关于大 pandas 有效地将列压缩成具有元组列表的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆