在scala中将列从一个数据框添加到另一个数据框 [英] add column from one dataframe to another dataframe in scala

查看:64
本文介绍了在scala中将列从一个数据框添加到另一个数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个具有相同行数的DataFrame,但是列数却不同,并且根据源而动态。

I have two DataFrame with same number of row, but number of column is different and dynamic according to source.

第一个DataFrame包含所有列,但是第二个DataFrame没有其他所有内容被过滤和处理。

First DataFrame contains all columns, but the second DataFrame is filtered and processed which don't have all other.

需要从第一个DataFrame中选择特定列并与第二个DataFrame添加/合并。

Need to pick specific column from first DataFrame and add/merge with second DataFrame.

val sourceDf = spark.read.load(parquetFilePath)
val resultDf = spark.read.load(resultFilePath)

val columnName :String="Col1"

我尝试了几种添加方式,在这里我只给出几个....

I tried to add in several ways, here i am just giving few one....

val modifiedResult = resultDf.withColumn(columnName, sourceDf.col(columnName))

val modifiedResult = resultDf.withColumn(columnName, sourceDf(columnName))
val modifiedResult = resultDf.withColumn(columnName, labelColumnUdf(sourceDf.col(columnName)))

这些都不起作用。

请问您能帮我从第一个DataFrame合并/添加列到第二个DataFrame吗?

Can you please help me on this to merge/add column to the 2nd DataFrame from 1st DataFrame.

给出的示例并非我需要的确切数据结构,但它将满足我解决该问题的要求。

Given example are not the exact data structure that i need, but it will fulfill my requirement to resolve this issue.

S足够的输入输出:

Source DataFrame:
+---+------+---+
|InputGas|
+---+------+---+
|1000|
|2000|
|3000|
|4000|
+---+------+---+

Result DataFrame:
+---+------+---+
| Time|CalcGas|Speed|
+---+------+---+
|  0 | 111| 1111|
|  0 | 222| 2222|
|  1 | 333| 3333|
|  2 | 444| 4444|
+---+------+---+

Expected Output:
+---+------+---+
|Time|CalcGas|Speed|InputGas|
+---+------+---+---+
|  0|111 | 1111 |1000|
|  0|222 | 2222 |2000|
|  1|333 | 3333 |3000|
|  2|444 | 4444 |4000|
+---+------+---+---+


推荐答案

使用 join

如果两个数据框中都有一些公共列,则可以对该列执行连接并获得所需的结果。

In case if you have some common column in both the dataframes then you can perform join on that column and get your desire result.

示例:

import sparkSession.sqlContext.implicits._

val df1 = Seq((1, "Anu"),(2, "Suresh"),(3, "Usha"), (4, "Nisha")).toDF("id","name")
val df2 = Seq((1, 23),(2, 24),(3, 24), (4, 25), (5, 30), (6, 32)).toDF("id","age")

val df = df1.as("df1").join(df2.as("df2"), df1("id") === df2("id")).select("df1.id", "df1.name", "df2.age")
df.show()

输出:

+---+------+---+
| id|  name|age|
+---+------+---+
|  1|   Anu| 23|
|  2|Suresh| 24|
|  3|  Usha| 24|
|  4| Nisha| 25|
+---+------+---+






更新:



如果两个数据帧中没有相同的唯一ID,则创建一个并使用它。


Update:

In case if you don't have any unique id common in both dataframes, then create one and use it.

import sparkSession.sqlContext.implicits._
import org.apache.spark.sql.functions._

var sourceDf = Seq(1000, 2000, 3000, 4000).toDF("InputGas")
var resultDf  = Seq((0, 111, 1111), (0, 222, 2222), (1, 333, 3333), (2, 444, 4444)).toDF("Time", "CalcGas", "Speed")

sourceDf = sourceDf.withColumn("rowId1", monotonically_increasing_id())
resultDf = resultDf.withColumn("rowId2", monotonically_increasing_id())

val df = sourceDf.as("df1").join(resultDf.as("df2"), sourceDf("rowId1") === resultDf("rowId2"), "inner").select("df1.InputGas", "df2.Time", "df2.CalcGas", "df2.Speed")
df.show()

输出:

+--------+----+-------+-----+
|InputGas|Time|CalcGas|Speed|
+--------+----+-------+-----+
|    1000|   0|    111| 1111|
|    2000|   0|    222| 2222|
|    3000|   1|    333| 3333|
|    4000|   2|    444| 4444|
+--------+----+-------+-----+

这篇关于在scala中将列从一个数据框添加到另一个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆