Apache Spark:如何使用Java在dataFrame中的空值列中插入数据 [英] Apache Spark : how to insert data in a column with empty values in dataFrame using Java

查看:54
本文介绍了Apache Spark:如何使用Java在dataFrame中的空值列中插入数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须使用 DataFrame2DataFrame1 中可用的值插入到具有空值的列之一中.基本上更新 DataFrame2 中的列.

I have to insert values available in DataFrame1 into one of the column with empty values with DataFrame2. Basically updating column in DataFrame2.

两个 DataFrame 都有 2 个公共列.

Both DataFrames have 2 common columns.

有没有办法使用 Java 做同样的事情?或者可以有不同的方法?

Is there a way to do same using Java? Or there can be different approach?

样本输入:

1) 文件 1.csv

1) File1.csv

BILL_ID,BILL_NBR_TYPE_CD,BILL_NBR,VERSION,PRIM_SW
0501841898,BIN     ,404154,1000,Y
0681220958,BIN     ,735332,1000,Y
5992410180,BIN     ,454680,1000,Y
6995270884,SREBIN  ,1000252750295575,1000,Y

这里BILL_ID是系统ID,BILL_NBR是外部ID.

Here BILL_ID is system id and BILL_NBR is external id.

2) File2.csv

2) File2.csv

TXN_ID,TXN_TYPE,BILL_ID,BILL_NBR_TYPE_CD,BILL_NBR
01234, ABC     ,"     ",BIN     ,404154
22365, XYZ     ,"     ",BIN     ,735332
45890, LKJ     ,"     ",BIN     ,454680
23456, MPK     ,"     ",SREBIN  ,1000252750295575

样本输出

如下所示 BILL_ID 值应填充在 File2.csv

As shown below BILL_ID value should be populated in File2.csv

01234, ABC     ,501841898,BIN     ,404154
22365, XYZ     ,681220958,BIN     ,735332
45890, LKJ     ,5992410180,BIN     ,454680
23456, MPK     ,6995270884,SREBIN  ,1000252750295575

我创建了两个 DataFrame 并将两个文件的数据加载到其中,现在我不确定如何继续.

I have created two DataFrames and loaded both file's data into it, now I am not sure how to proceed.

编辑

基本上我想清楚以下三个步骤:

Basically I want clarity on below three steps:

  1. 如何从 File2.csv 获取 BILL_NBR 和 BILL_NBR_TYPE_CD 值?

对于这一步我写了:file2Df.select("BILL_NBR_TYPE_CD","BILL_NBR");

  1. 如何根据步骤 1 中检索到的值从 File1.csv 中获取 BILL_ID 值?

  1. How to get BILL_ID values from File1.csv based on the values retrieved in step1 ?

如何在 File2.csv 中相应地更新 BILL_ID 值?

How to update BILL_ID values accordingly in File2.csv ?

我是 spark 新手,如果有人能指点一下,我将不胜感激.

I am new to spark and I would appreciate if someone can give pointers.

推荐答案

您需要根据 BILL_NBR 列连接两个表.

You need to join two tables based on BILL_NBR column.

假设:BILL_NBRBILL_ID 列之间存在一对一的关系.

Assumption: There is one to one relation between BILL_NBR and BILL_ID columns.

假设 File1.csv 和 File2.csv 的数据帧名称分别为 file1DFfile2DF,以下应该对您有用:

Assuming that your dataframe names for File1.csv and File2.csv are file1DF and file2DF respectively, following should work for you:

Dataset<Row> file1DF = file1DF.select("BILL_ID","BILL_NBR","BILL_NBR_TYPE_CD");
Dataset<Row> file2DF = file2DF.select("TXN_ID","TXN_TYPE","BILL_NBR_TYPE_CD","BILL_NBR");
Dataset<Row> file2DF = file2DF.join(file1DF, file1DF("BILL_NBR","BILL_NBR_TYPE_CD"));

注意:我没有资源来运行它来测试上面的代码.如果您遇到任何编译时或运行时错误,请告诉我.

Note: I haven't got resources to test above code by running it. Please let me know if you face any compile time or run time error.

这篇关于Apache Spark:如何使用Java在dataFrame中的空值列中插入数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆