在没有公共列的情况下联接两个数据框 [英] Joining two dataframes without a common column
问题描述
我有两个具有不同类型列的数据框.我需要加入这两个不同的数据框.请参考以下示例
I have two dataframes which has different types of columns. I need to join those two different dataframe. Please refer the below example
val df1 has
Customer_name
Customer_phone
Customer_age
val df2 has
Order_name
Order_ID
这两个数据框没有任何公共列.两个数据框中的行数和列数也有所不同.我试图插入一个新的虚拟列以增加row_index值,如下所示val dfr = df1.withColumn("row_index",monotonically_increasing_id()).
These two dataframe doesn't have any common column. Number of rows and Number of columns in the two dataframes also differs. I tried to insert a new dummy column to increase the row_index value as below val dfr=df1.withColumn("row_index",monotonically_increasing_id()).
但是当我使用Spark 2时,不支持monotonically_increasing_id方法.有什么方法可以连接两个数据框,以便我可以在一张Excel文件中创建两个数据框的值.
But as i am using Spark 2, monotonically_increasing_id method is not supported. Is there any way to join two dataframe, so that I can create the value of two dataframe in a single sheet of excel file.
例如
val df1:
Customer_name Customer_phone Customer_age
karti 9685684551 24
raja 8595456552 22
val df2:
Order_name Order_ID
watch 1
cattoy 2
我的最终Excel工作表应如下所示:
My final excel sheet should be like this:
Customer_name Customer_phone Customer_age Order_name Order_ID
karti 9685684551 24 watch 1
raja 8595456552 22 cattoy 2
推荐答案
monotonically_increasing_id()
为 增加 和 (唯一) ,但不是 连续 .
monotonically_increasing_id()
is increasing and unique but not consecutive.
您可以使用 zipWithIndex
,方法是转换为 rdd
并为两个 dataframe
重建具有相同架构的Dataframe.
You can use zipWithIndex
by converting to rdd
and reconstructing Dataframe with the same schema for both dataframe
.
import spark.implicits._
val df1 = Seq(
("karti", "9685684551", 24),
("raja", "8595456552", 22)
).toDF("Customer_name", "Customer_phone", "Customer_age")
val df2 = Seq(
("watch", 1),
("cattoy", 2)
).toDF("Order_name", "Order_ID")
val df11 = spark.sqlContext.createDataFrame(
df1.rdd.zipWithIndex.map {
case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
// Create schema for index column
StructType(df1.schema.fields :+ StructField("index", LongType, false))
)
val df22 = spark.sqlContext.createDataFrame(
df2.rdd.zipWithIndex.map {
case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
// Create schema for index column
StructType(df2.schema.fields :+ StructField("index", LongType, false))
)
现在加入最终数据框
df11.join(df22, Seq("index")).drop("index")
输出:
+-------------+--------------+------------+----------+--------+
|Customer_name|Customer_phone|Customer_age|Order_name|Order_ID|
+-------------+--------------+------------+----------+--------+
|karti |9685684551 |24 |watch |1 |
|raja |8595456552 |22 |cattoy |2 |
+-------------+--------------+------------+----------+--------+
这篇关于在没有公共列的情况下联接两个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!