在 pandas 中设置联盟 [英] Set Union in pandas

查看:98
本文介绍了在 pandas 中设置联盟的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两列,将集存储在数据框中.

I have two columns which I stored sets in my dataframe.

我想使用快速矢量化操作在两列上执行集合并集

I want to perform set union on the two columns using fast vectorized operation

df['union'] = df.set1 | df.set2

但是错误TypeError: unsupported operand type(s) for |: 'set' and 'bool'阻止了我这样做,因为我在两列中都输入了np.nan.

but the error TypeError: unsupported operand type(s) for |: 'set' and 'bool' is preventing me from doing so as I have type np.nan in both columns.

是否有解决此问题的好方法?

Is there a good solution to overcome this?

推荐答案

对于这些操作,纯Python可能更有效.

For these operations pure Python may be more efficient.

%timeit pd.Series([set1.union(set2) for set1, set2 in zip(df['A'], df['B'])])
10 loops, best of 3: 43.3 ms per loop

%timeit df.apply(lambda x: x.A.union(x.B), axis=1)
1 loop, best of 3: 2.6 s per loop

如果我们可以使用+,则可能会花费一半的时间(继承可能不值得):

If we could use +, it would probably take half the time (inheritance may not worth it):

%timeit df['A'] - df['B']
10 loops, best of 3: 22.1 ms per loop

%timeit pd.Series([set1.difference(set2) for set1, set2 in zip(df['A'], df['B'])])
10 loops, best of 3: 35.7 ms per loop


DataFrame进行计时:


DataFrame for timings:

import pandas as pd
import numpy as np
l1 = [set(np.random.choice(list('abcdefg'), np.random.randint(1, 5))) for _ in range(100000)]
l2 = [set(np.random.choice(list('abcdefg'), np.random.randint(1, 5))) for _ in range(100000)]

df = pd.DataFrame({'A': l1, 'B': l2})

这篇关于在 pandas 中设置联盟的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆