在带有 pandas 的列表列表中找到所有匹配的组 [英] Find all matching groups in a list of lists with pandas

查看:55
本文介绍了在带有 pandas 的列表列表中找到所有匹配的组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在Pandas DataFrame中找到所有ID的所有情况. 什么是有效的解决方案?我有大约1万条记录,并且已在服务器端处理.创建一个新的DataFrame是一个好主意,还是我可以使用更有效的数据结构?当一个ID包含案例中的所有名称时,满足案例.

I would like to find all cases for all ids in a Pandas DataFrame. What would be an efficient solution? I have around 10k of records and it is processed server-side. Would it be a good idea to create a new DataFrame, or is there a more efficient data structure I can use? A case is satisfied when an id contains all names in a case.

输入(熊猫数据框)

id | name |
-----------
1  | bla1 |
2  | bla2 |
2  | bla3 |
2  | bla4 |
3  | bla5 |
4  | bla9 |
5  | bla6 |
5  | bla7 |
6  | bla8 |

案例

names [
  [bla2, bla3, bla4], #case 1
  [bla1, bla3, bla7], #case 2
  [bla3, bla1, bla6], #case 3
  [bla6, bla7] #case 4
]

需要的输出(除非有更有效的方法)

id | case1 | case2 | case3 | case4 |
------------------------------------
1  | 0     | 0     | 0     | 0     |
2  | 1     | 0     | 0     | 0     |
3  | 0     | 0     | 0     | 0     |
4  | 0     | 0     | 0     | 0     |
5  | 0     | 0     | 0     | 1     |
6  | 0     | 0     | 0     | 0     |

推荐答案

names = [
   ['bla2', 'bla3', 'bla4'], # case 1
   ['bla1', 'bla3', 'bla7'], # case 2
   ['bla3', 'bla1', 'bla6'], # case 3
   ['bla6', 'bla7']          # case 4
]

df = df.groupby('id').apply(lambda x: \
                pd.Series([int(pd.Series(y).isin(x['name']).all()) for y in names]))\
       .rename(columns=lambda x: 'case{}'.format(x + 1))

df
+------+---------+---------+---------+---------+
|   id |   case1 |   case2 |   case3 |   case4 |
|------+---------+---------+---------+---------|
|    1 |       0 |       0 |       0 |       0 |
|    2 |       1 |       0 |       0 |       0 |
|    3 |       0 |       0 |       0 |       0 |
|    5 |       0 |       0 |       0 |       1 |
|    6 |       0 |       0 |       0 |       0 |
+------+---------+---------+---------+---------+

首先,groupby id,然后对每个组依次对每个案例应用检查.目的是检查组中的所有项目是否将与给定案例匹配.这由isin与列表推导一起处理.外部的pd.Series将结果扩展到单独的列,而df.rename用于重命名列.

First, groupby id, and then apply apply a check successively on each case, for each group. The objective is to check whether all items in a group will match with a given case. This is handled by the isin in conjunction with the list comprehension. The outer pd.Series will expand the result to separate columns and df.rename is used to rename the columns.

这篇关于在带有 pandas 的列表列表中找到所有匹配的组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆