Spark:将dataframe列与数组连接 [英] Spark: Join dataframe column with an array

查看:136
本文介绍了Spark:将dataframe列与数组连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个具有两列的DataFrame

I have two DataFrames with two columns

    具有架构(key1:Long, Value)

df1

  • df2具有架构(key2:Array[Long], Value)

    我需要将这些DataFrames连接到键列上(在key1key2中的值之间找到匹配的值).但是问题在于它们的类型不同.有没有办法做到这一点?

    I need to join these DataFrames on the key columns (find matching values between key1 and values in key2). But the problem is that they have not the same type. Is there a way to do this?

    推荐答案

    您可以广播 key1和key2的类型,然后使用 contains 函数,如下所示.

    You can cast the type of key1 and key2 and then use the contains function, as follow.

    val df1 = sc.parallelize(Seq((1L,"one.df1"), 
                                 (2L,"two.df1"),      
                                 (3L,"three.df1"))).toDF("key1","Value")  
    
    DF1:
    +----+---------+
    |key1|Value    |
    +----+---------+
    |1   |one.df1  |
    |2   |two.df1  |
    |3   |three.df1|
    +----+---------+
    
    val df2 = sc.parallelize(Seq((Array(1L,1L),"one.df2"),
                                 (Array(2L,2L),"two.df2"),
                                 (Array(3L,3L),"three.df2"))).toDF("key2","Value")
    DF2:
    +------+---------+
    |key2  |Value    |
    +------+---------+
    |[1, 1]|one.df2  |
    |[2, 2]|two.df2  |
    |[3, 3]|three.df2|
    +------+---------+
    
    val joinedRDD = df1.join(df2, col("key2").cast("string").contains(col("key1").cast("string")))
    
    JOIN:
    +----+---------+------+---------+
    |key1|Value    |key2  |Value    |
    +----+---------+------+---------+
    |1   |one.df1  |[1, 1]|one.df2  |
    |2   |two.df1  |[2, 2]|two.df2  |
    |3   |three.df1|[3, 3]|three.df2|
    +----+---------+------+---------+
    

    这篇关于Spark:将dataframe列与数组连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆