如何找到基于多列的数据框的交集? [英] How to find intersection of dataframes based on multiple columns?

查看:59
本文介绍了如何找到基于多列的数据框的交集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框,如下所示.我试图基于两列中的任何一个,不仅是两个,都找到两个数据框的交集.

I have two dataframes as below. I'm trying to find the intersection of two dataframes based on either of the two columns, not only both of them.

因此,在这种情况下,我想返回数据帧C,它具有df A行1(因为A row1 col1 = B中的第一行col1),df A row 2(A 2行Col2 = B中的行1 Col2)和df A第4行(如B中的Col1第2行= A中的Col 1第4行),以及A中的第5行.但是,如果我与A和B相交,它将仅返回A中的第5行,因为两列都匹配.我该怎么做呢?非常感谢.让我知道我对这个问题的解释不是很好.

So In this case, I want to return dataframe C, which has df A row 1 (as A row1 col1= row one col1 in B), df A row 2(A row 2 Col 2=row 1 Col2 in B) and df A row 4(as Col1 row 2 in B = Col 1 row 4 in A), and row 5 in A. But if I do a intersect of A and B, it will only return row 5 in A, as that's a match of both columns. How do I do this? Many thanks.Let me know if I'm not explaining the question very well.

A:

     Col1    Col2 
     1         2    
     2         3
     3         7 
     5         4
     1         3   

B:

    Col1    Col2 
     1         3    
     5         1

C:

          1         2    
          2         3
          5         4
          1         3    

推荐答案

具有以下数据:

val df1 = sc.parallelize(Seq(1->2, 2->3, 3->7, 5->4, 1->3)).toDF("col1", "col2")
val df2 = sc.parallelize(Seq(1->3, 5->1)).toDF("col1", "col2")

然后,您可以使用或条件将数据集连接起来:

Then you can join your datasets with a or condition:

val cols = df1.columns
df1.join(df2, cols.map(c => df1(c) === df2(c)).reduce(_ || _) )
   .select(cols.map(df1(_)) :_*)
   .distinct
   .show

+----+----+
|col1|col2|
+----+----+
|   2|   3|
|   1|   2|
|   1|   3|
|   5|   4|
+----+----+

连接条件是通用的,适用于任意数量的列.该代码将每一列映射到df1中的该列与df2中的同一列之间. maps.map(c => df1(c)=== df2(c)).减少需要逻辑上的或所有这些相等的东西,这就是您想要的.选择在那里是因为否则将保留两个数据帧的列.在这里,我只是保留那些来自df1的文件.我还添加了一个独特的方法,以防df2的几行与df1的行匹配,反之亦然.确实,您可能会得到笛卡尔积.

The join condition is generic and would work for any number of columns. The code maps each column to an equality between that column in df1 and the same one in df2 cols.map(c => df1(c) === df2(c)). The the reduce takes the logical or of all these equalities, which is what you want. The select is there because otherwise the columns of both dataframes would be kept. Here I simply keep the ones from df1. I also added a distinct in case several lines of df2 would match a line of df1 or vice versa. Indeed, you may get a cartesian product.

请注意,此方法不需要对驱动程序进行任何收集,因此无论数据集的大小如何,它都将起作用.但是,如果df2足够小,可以收集到驱动程序并进行广播,则可以使用以下方法获得更快的结果:

Note that this method does not need any collection to the driver so it will work regardless of the size of the datasets. Yet, if df2 is small enough to be collected to the driver and braodcasted, you would get faster results with a method like this:

// to each column name, we map the set of values in df2.
val valueMap = df2.rdd
    .flatMap(row => cols.map(name => name -> row.getAs[Any](name)))
    .distinct
    .groupByKey
    .mapValues(_.toSet)
    .collectAsMap

//we create a udf that looks up in valueMap
val filter = udf((name : String, value : Any) => 
                     valueMap(name).contains(value))

//Finally we apply the filter.
df1.where( cols.map(c => filter(lit(c), df1(c))).reduce(_||_))
   .show

使用这种方法,不会对df1进行改组,也不会对笛卡尔积进行运算.如果df2很小,这绝对是可行的方法.

With this method, no shuffling of df1 and no cartesian product. If df2 is small, this is definitely the way to go.

这篇关于如何找到基于多列的数据框的交集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆