Apache Spark:如何使用Java在dataFrame中具有空值的列中插入数据 [英] Apache Spark : how to insert data in a column with empty values in dataFrame using Java
问题描述
我必须将 DataFrame1 中可用的值插入到带有 DataFrame2 的空值的一列中.基本上更新 DataFrame2 中的列.
I have to insert values available in DataFrame1 into one of the column with empty values with DataFrame2. Basically updating column in DataFrame2.
两个DataFrame都有2个共同的列.
Both DataFrames have 2 common columns.
有没有一种方法可以使用Java?还是可以有其他方法?
Is there a way to do same using Java? Or there can be different approach?
示例输入:
1)File1.csv
1) File1.csv
BILL_ID,BILL_NBR_TYPE_CD,BILL_NBR,VERSION,PRIM_SW
0501841898,BIN ,404154,1000,Y
0681220958,BIN ,735332,1000,Y
5992410180,BIN ,454680,1000,Y
6995270884,SREBIN ,1000252750295575,1000,Y
此处BILL_ID
是系统ID,而BILL_NBR
是外部ID.
Here BILL_ID
is system id and BILL_NBR
is external id.
2)File2.csv
2) File2.csv
TXN_ID,TXN_TYPE,BILL_ID,BILL_NBR_TYPE_CD,BILL_NBR
01234, ABC ," ",BIN ,404154
22365, XYZ ," ",BIN ,735332
45890, LKJ ," ",BIN ,454680
23456, MPK ," ",SREBIN ,1000252750295575
示例输出
如下所示,BILL_ID
值应填充在File2.csv
As shown below BILL_ID
value should be populated in File2.csv
01234, ABC ,501841898,BIN ,404154
22365, XYZ ,681220958,BIN ,735332
45890, LKJ ,5992410180,BIN ,454680
23456, MPK ,6995270884,SREBIN ,1000252750295575
我已经创建了两个DataFrame,并将两个文件的数据都加载到其中,现在我不确定如何继续.
I have created two DataFrames and loaded both file's data into it, now I am not sure how to proceed.
编辑
基本上,我希望在以下三个步骤上做到清晰:
Basically I want clarity on below three steps:
- 如何从File2.csv中获取BILL_NBR和BILL_NBR_TYPE_CD值?
对于这一步,我写了:file2Df.select("BILL_NBR_TYPE_CD","BILL_NBR");
For this step I have written : file2Df.select("BILL_NBR_TYPE_CD","BILL_NBR");
-
如何基于在step1中检索到的值从File1.csv中获取BILL_ID值?
How to get BILL_ID values from File1.csv based on the values retrieved in step1 ?
如何在File2.csv中相应地更新BILL_ID值?
How to update BILL_ID values accordingly in File2.csv ?
我是新手,如果有人可以指点我,我将不胜感激.
I am new to spark and I would appreciate if someone can give pointers.
推荐答案
您需要基于BILL_NBR
列联接两个表.
You need to join two tables based on BILL_NBR
column.
假设:BILL_NBR
和BILL_ID
列之间存在一对一的关系.
Assumption: There is one to one relation between BILL_NBR
and BILL_ID
columns.
假设您的File1.csv和File2.csv的数据框名称分别为file1DF
和file2DF
,则以下内容对您来说应该适用:
Assuming that your dataframe names for File1.csv and File2.csv are file1DF
and file2DF
respectively, following should work for you:
Dataset<Row> file1DF = file1DF.select("BILL_ID","BILL_NBR","BILL_NBR_TYPE_CD");
Dataset<Row> file2DF = file2DF.select("TXN_ID","TXN_TYPE","BILL_NBR_TYPE_CD","BILL_NBR");
Dataset<Row> file2DF = file2DF.join(file1DF, file1DF("BILL_NBR","BILL_NBR_TYPE_CD"));
注意:我没有足够的资源通过运行它来测试以上代码.如果您遇到任何编译时或运行时错误,请告诉我.
Note: I haven't got resources to test above code by running it. Please let me know if you face any compile time or run time error.
这篇关于Apache Spark:如何使用Java在dataFrame中具有空值的列中插入数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!