基于列表列加入两个 pandas 数据框 [英] Join two pandas dataframes based on lists columns
问题描述
我有 2 个数据框,其中包含列表列.
我想根据列表中的 2+ 共享价值观加入他们.示例:
I have 2 dataframes containing columns of lists.
I would like to join them based on 2+ shared values on the lists. Example:
ColumnA ColumnB | ColumnA ColumnB
id1 ['a','b','c'] | id3 ['a','b','c','x','y', 'z']
id2 ['a','d,'e'] |
在这种情况下,我们可以看到 id1 匹配 id3,因为列表中有 2 个以上的共享值.所以输出将是(列名并不重要,只是举例):
In this case we can see that id1 matches id3 because there are 2+ shared values on the lists. So the output will be (columns name are not important and just for example):
ColumnA1 ColumnB1 ColumnA2 ColumnB2
id1 ['a','b','c'] id3 ['a','b','c','x','y', 'z']
我怎样才能达到这个结果?我试图迭代数据帧 #1 中的每一行,但这似乎不是一个好主意.
谢谢!
How can I achieve this result? I've tried to iterate each row in dataframe #1 but it doesn't seem a good idea.
Thank you!
推荐答案
如果您使用的是 pandas 1.2.0 或更高版本(2020 年 12 月 26 日发布),笛卡尔积(交叉关节)可以简化如下:
If you are using pandas 1.2.0 or newer (released on December 26, 2020), cartesian product (cross joint) can be simplified as follows:
df = df1.merge(df2, how='cross') # simplified cross joint for pandas >= 1.2.0
另外,如果您关心系统性能(执行时间),建议使用 list(map...
而不是较慢的 apply(...axis=1)
Also, if system performance (execution time) is a concern to you, it is advisable to use list(map...
instead of the slower apply(... axis=1)
使用apply(...axis=1)
:
%%timeit
df['overlap'] = df.apply(lambda x:
len(set(x['ColumnB1']).intersection(
set(x['ColumnB2']))), axis=1)
800 µs ± 59.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
同时使用 list(map(...
:
%%timeit
df['overlap'] = list(map(lambda x, y: len(set(x).intersection(set(y))), df['ColumnB1'], df['ColumnB2']))
217 µs ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
请注意,使用 list(map...
快 3 倍!
Notice that using list(map...
is 3x times faster!
整套代码供您参考:
data = {'ColumnA1': ['id1', 'id2'], 'ColumnB1': [['a', 'b', 'c'], ['a', 'd', 'e']]}
df1 = pd.DataFrame(data)
data = {'ColumnA2': ['id3', 'id4'], 'ColumnB2': [['a','b','c','x','y', 'z'], ['d','e','f','p','q', 'r']]}
df2 = pd.DataFrame(data)
df = df1.merge(df2, how='cross') # for pandas version >= 1.2.0
df['overlap'] = list(map(lambda x, y: len(set(x).intersection(set(y))), df['ColumnB1'], df['ColumnB2']))
df = df[df['overlap'] >= 2]
print (df)
这篇关于基于列表列加入两个 pandas 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!