加入两个数据框而没有共同的列火花,scala [英] join two dataframe without having common column spark, scala

查看:58
本文介绍了加入两个数据框而没有共同的列火花,scala的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个具有不同类型列的数据框.我需要加入这两个不同的数据框.请参考以下示例

I have two dataframes which has different types of columns. I need to join those two different dataframe. Please refer the below example

val df1 has
Customer_name 
Customer_phone
Customer_age

val df2 has
Order_name
Order_ID

这两个数据框没有任何公共列.两个数据框中的行数和列数也有所不同.我试图插入一个新的虚拟列以增加row_index值,如下所示 val dfr = df1.withColumn("row_index",monotonically_increasing_id()).

These two dataframe doesn't have any common column. Number of rows and Number of columns in the two dataframes also differs. I tried to insert a new dummy column to increase the row_index value as below val dfr=df1.withColumn("row_index",monotonically_increasing_id()).

但是当我使用spark-2时,monotonically_increasing_id方法不支持我.有没有办法联接两个数据框.这样我就可以在一张Excel文件中创建两个数据框的值.

But As i am using spark-2, monotonically_increasing_id method is not supporting for me. Is there any way to join two dataframe. So that i can create the value of two dataframe in a single sheet of excel file.

例如

val df1:
Customer_name  Customer_phone  Customer_age
karti           9685684551     24      
raja            8595456552     22

val df2:
Order_name Order_ID
watch       1
cattoy     2

我的最终Excel工作表应如下所示:

My final excel sheet should be like this:

Customer_name  Customer_phone  Customer_age   Order_name  Order_ID

karti          9685684551      24             watch        1

raja           8595456552      22             cattoy      2

推荐答案

monotonically_increasing_id() 增加 唯一 ,但不是 连续 .

monotonically_increasing_id() is increasing and unique but not consecutive.

您可以使用zipWithIndex,方法是转换为rdd并为两个dataframe重建具有相同架构的数据框.

You can use zipWithIndex by converting to rdd and reconstructing Dataframe with the same schema for both dataframe.

import spark.implicits._


val df1 = Seq(
  ("karti", "9685684551", 24),
  ("raja", "8595456552", 22)
).toDF("Customer_name", "Customer_phone", "Customer_age")


val df2 = Seq(
  ("watch", 1),
  ("cattoy", 2)
).toDF("Order_name", "Order_ID")

val df11 = spark.sqlContext.createDataFrame(
  df1.rdd.zipWithIndex.map {
    case (row, index) => Row.fromSeq(row.toSeq :+ index)
  },
  // Create schema for index column
  StructType(df1.schema.fields :+ StructField("index", LongType, false))
)


val df22 = spark.sqlContext.createDataFrame(
  df2.rdd.zipWithIndex.map {
    case (row, index) => Row.fromSeq(row.toSeq :+ index)
  },
  // Create schema for index column
  StructType(df2.schema.fields :+ StructField("index", LongType, false))
)

现在加入最终数据框

df11.join(df22, Seq("index")).drop("index")

输出:

+-------------+--------------+------------+----------+--------+
|Customer_name|Customer_phone|Customer_age|Order_name|Order_ID|
+-------------+--------------+------------+----------+--------+
|karti        |9685684551    |24          |watch     |1       |
|raja         |8595456552    |22          |cattoy    |2       |
+-------------+--------------+------------+----------+--------+

这篇关于加入两个数据框而没有共同的列火花,scala的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆