Apache Spark:如何使用Java在dataFrame中的空值列中插入数据 [英] Apache Spark : how to insert data in a column with empty values in dataFrame using Java
问题描述
我必须使用 DataFrame2 将 DataFrame1 中可用的值插入到具有空值的列之一中.基本上更新 DataFrame2 中的列.
I have to insert values available in DataFrame1 into one of the column with empty values with DataFrame2. Basically updating column in DataFrame2.
两个 DataFrame 都有 2 个公共列.
Both DataFrames have 2 common columns.
有没有办法使用 Java 做同样的事情?或者可以有不同的方法?
Is there a way to do same using Java? Or there can be different approach?
样本输入:
1) 文件 1.csv
1) File1.csv
BILL_ID,BILL_NBR_TYPE_CD,BILL_NBR,VERSION,PRIM_SW
0501841898,BIN ,404154,1000,Y
0681220958,BIN ,735332,1000,Y
5992410180,BIN ,454680,1000,Y
6995270884,SREBIN ,1000252750295575,1000,Y
这里BILL_ID
是系统ID,BILL_NBR
是外部ID.
Here BILL_ID
is system id and BILL_NBR
is external id.
2) File2.csv
2) File2.csv
TXN_ID,TXN_TYPE,BILL_ID,BILL_NBR_TYPE_CD,BILL_NBR
01234, ABC ," ",BIN ,404154
22365, XYZ ," ",BIN ,735332
45890, LKJ ," ",BIN ,454680
23456, MPK ," ",SREBIN ,1000252750295575
样本输出
如下所示 BILL_ID
值应填充在 File2.csv
As shown below BILL_ID
value should be populated in File2.csv
01234, ABC ,501841898,BIN ,404154
22365, XYZ ,681220958,BIN ,735332
45890, LKJ ,5992410180,BIN ,454680
23456, MPK ,6995270884,SREBIN ,1000252750295575
我创建了两个 DataFrame 并将两个文件的数据加载到其中,现在我不确定如何继续.
I have created two DataFrames and loaded both file's data into it, now I am not sure how to proceed.
编辑
基本上我想清楚以下三个步骤:
Basically I want clarity on below three steps:
- 如何从 File2.csv 获取 BILL_NBR 和 BILL_NBR_TYPE_CD 值?
对于这一步我写了:file2Df.select("BILL_NBR_TYPE_CD","BILL_NBR");
如何根据步骤 1 中检索到的值从 File1.csv 中获取 BILL_ID 值?
How to get BILL_ID values from File1.csv based on the values retrieved in step1 ?
如何在 File2.csv 中相应地更新 BILL_ID 值?
How to update BILL_ID values accordingly in File2.csv ?
我是 spark 新手,如果有人能指点一下,我将不胜感激.
I am new to spark and I would appreciate if someone can give pointers.
推荐答案
您需要根据 BILL_NBR
列连接两个表.
You need to join two tables based on BILL_NBR
column.
假设:BILL_NBR
和 BILL_ID
列之间存在一对一的关系.
Assumption: There is one to one relation between BILL_NBR
and BILL_ID
columns.
假设 File1.csv 和 File2.csv 的数据帧名称分别为 file1DF
和 file2DF
,以下应该对您有用:
Assuming that your dataframe names for File1.csv and File2.csv are file1DF
and file2DF
respectively, following should work for you:
Dataset<Row> file1DF = file1DF.select("BILL_ID","BILL_NBR","BILL_NBR_TYPE_CD");
Dataset<Row> file2DF = file2DF.select("TXN_ID","TXN_TYPE","BILL_NBR_TYPE_CD","BILL_NBR");
Dataset<Row> file2DF = file2DF.join(file1DF, file1DF("BILL_NBR","BILL_NBR_TYPE_CD"));
注意:我没有资源来运行它来测试上面的代码.如果您遇到任何编译时或运行时错误,请告诉我.
Note: I haven't got resources to test above code by running it. Please let me know if you face any compile time or run time error.
这篇关于Apache Spark:如何使用Java在dataFrame中的空值列中插入数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!