在没有公共列的情况下联接两个数据框 [英] Joining two dataframes without a common column

查看:72
本文介绍了在没有公共列的情况下联接两个数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个具有不同类型列的数据框.我需要加入这两个不同的数据框.请参考以下示例

I have two dataframes which has different types of columns. I need to join those two different dataframe. Please refer the below example

val df1 has
Customer_name 
Customer_phone
Customer_age

val df2 has
Order_name
Order_ID

这两个数据框没有任何公共列.两个数据框中的行数和列数也有所不同.我试图插入一个新的虚拟列以增加row_index值,如下所示val dfr = df1.withColumn("row_index",monotonically_increasing_id()).

These two dataframe doesn't have any common column. Number of rows and Number of columns in the two dataframes also differs. I tried to insert a new dummy column to increase the row_index value as below val dfr=df1.withColumn("row_index",monotonically_increasing_id()).

但是当我使用Spark 2时,不支持monotonically_increasing_id方法.有什么方法可以连接两个数据框,以便我可以在一张Excel文件中创建两个数据框的值.

But as i am using Spark 2, monotonically_increasing_id method is not supported. Is there any way to join two dataframe, so that I can create the value of two dataframe in a single sheet of excel file.

例如

val df1:
Customer_name  Customer_phone  Customer_age
karti           9685684551     24      
raja            8595456552     22

val df2:
Order_name Order_ID
watch       1
cattoy     2

我的最终Excel工作表应如下所示:

My final excel sheet should be like this:

Customer_name  Customer_phone  Customer_age   Order_name  Order_ID

karti          9685684551      24             watch        1
   
raja           8595456552      22             cattoy      2

推荐答案

monotonically_increasing_id() 增加 (唯一) ,但不是 连续 .

monotonically_increasing_id() is increasing and unique but not consecutive.

您可以使用 zipWithIndex ,方法是转换为 rdd 并为两个 dataframe 重建具有相同架构的Dataframe.

You can use zipWithIndex by converting to rdd and reconstructing Dataframe with the same schema for both dataframe.

import spark.implicits._


val df1 = Seq(
  ("karti", "9685684551", 24),
  ("raja", "8595456552", 22)
).toDF("Customer_name", "Customer_phone", "Customer_age")


val df2 = Seq(
  ("watch", 1),
  ("cattoy", 2)
).toDF("Order_name", "Order_ID")

val df11 = spark.sqlContext.createDataFrame(
  df1.rdd.zipWithIndex.map {
    case (row, index) => Row.fromSeq(row.toSeq :+ index)
  },
  // Create schema for index column
  StructType(df1.schema.fields :+ StructField("index", LongType, false))
)


val df22 = spark.sqlContext.createDataFrame(
  df2.rdd.zipWithIndex.map {
    case (row, index) => Row.fromSeq(row.toSeq :+ index)
  },
  // Create schema for index column
  StructType(df2.schema.fields :+ StructField("index", LongType, false))
)

现在加入最终数据框

df11.join(df22, Seq("index")).drop("index")

输出:

+-------------+--------------+------------+----------+--------+
|Customer_name|Customer_phone|Customer_age|Order_name|Order_ID|
+-------------+--------------+------------+----------+--------+
|karti        |9685684551    |24          |watch     |1       |
|raja         |8595456552    |22          |cattoy    |2       |
+-------------+--------------+------------+----------+--------+

这篇关于在没有公共列的情况下联接两个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆