Apache Spark:如何使用Java在dataFrame中具有空值的列中插入数据 [英] Apache Spark : how to insert data in a column with empty values in dataFrame using Java

查看:202
本文介绍了Apache Spark:如何使用Java在dataFrame中具有空值的列中插入数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须将 DataFrame1 中可用的值插入到带有 DataFrame2 的空值的一列中.基本上更新 DataFrame2 中的列.

I have to insert values available in DataFrame1 into one of the column with empty values with DataFrame2. Basically updating column in DataFrame2.

两个DataFrame都有2个共同的列.

Both DataFrames have 2 common columns.

有没有一种方法可以使用Java?还是可以有其他方法?

Is there a way to do same using Java? Or there can be different approach?

示例输入:

1)File1.csv

1) File1.csv

BILL_ID,BILL_NBR_TYPE_CD,BILL_NBR,VERSION,PRIM_SW
0501841898,BIN     ,404154,1000,Y
0681220958,BIN     ,735332,1000,Y
5992410180,BIN     ,454680,1000,Y
6995270884,SREBIN  ,1000252750295575,1000,Y

此处BILL_ID是系统ID,而BILL_NBR是外部ID.

Here BILL_ID is system id and BILL_NBR is external id.

2)File2.csv

2) File2.csv

TXN_ID,TXN_TYPE,BILL_ID,BILL_NBR_TYPE_CD,BILL_NBR
01234, ABC     ,"     ",BIN     ,404154
22365, XYZ     ,"     ",BIN     ,735332
45890, LKJ     ,"     ",BIN     ,454680
23456, MPK     ,"     ",SREBIN  ,1000252750295575

示例输出

如下所示,BILL_ID值应填充在File2.csv

As shown below BILL_ID value should be populated in File2.csv

01234, ABC     ,501841898,BIN     ,404154
22365, XYZ     ,681220958,BIN     ,735332
45890, LKJ     ,5992410180,BIN     ,454680
23456, MPK     ,6995270884,SREBIN  ,1000252750295575

我已经创建了两个DataFrame,并将两个文件的数据都加载到其中,现在我不确定如何继续.

I have created two DataFrames and loaded both file's data into it, now I am not sure how to proceed.

编辑

基本上,我希望在以下三个步骤上做到清晰:

Basically I want clarity on below three steps:

  1. 如何从File2.csv中获取BILL_NBR和BILL_NBR_TYPE_CD值?

对于这一步,我写了:file2Df.select("BILL_NBR_TYPE_CD","BILL_NBR");

For this step I have written : file2Df.select("BILL_NBR_TYPE_CD","BILL_NBR");

  1. 如何基于在step1中检索到的值从File1.csv中获取BILL_ID值?

  1. How to get BILL_ID values from File1.csv based on the values retrieved in step1 ?

如何在File2.csv中相应地更新BILL_ID值?

How to update BILL_ID values accordingly in File2.csv ?

我是新手,如果有人可以指点我,我将不胜感激.

I am new to spark and I would appreciate if someone can give pointers.

推荐答案

您需要基于BILL_NBR列联接两个表.

You need to join two tables based on BILL_NBR column.

假设:BILL_NBRBILL_ID列之间存在一对一的关系.

Assumption: There is one to one relation between BILL_NBR and BILL_ID columns.

假设您的File1.csv和File2.csv的数据框名称分别为file1DFfile2DF,则以下内容对您来说应该适用:

Assuming that your dataframe names for File1.csv and File2.csv are file1DF and file2DF respectively, following should work for you:

Dataset<Row> file1DF = file1DF.select("BILL_ID","BILL_NBR","BILL_NBR_TYPE_CD");
Dataset<Row> file2DF = file2DF.select("TXN_ID","TXN_TYPE","BILL_NBR_TYPE_CD","BILL_NBR");
Dataset<Row> file2DF = file2DF.join(file1DF, file1DF("BILL_NBR","BILL_NBR_TYPE_CD"));

注意:我没有足够的资源通过运行它来测试以上代码.如果您遇到任何编译时或运行时错误,请告诉我.

Note: I haven't got resources to test above code by running it. Please let me know if you face any compile time or run time error.

这篇关于Apache Spark:如何使用Java在dataFrame中具有空值的列中插入数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆