加入一个数据帧spark java [英] join in a dataframe spark java
问题描述
首先,感谢您抽出时间阅读我的问题。
First of all, thank you for the time in reading my question.
我的问题如下:在Spark with Java中,我在数据框中加载数据两个csv文件。
My question is the following: In Spark with Java, i load in two dataframe the data of two csv files.
这些数据框将包含以下信息。
These dataframes will have the following information.
Dataframe Airport
Dataframe Airport
Id | Name | City
-----------------------
1 | Barajas | Madrid
Dataframe airport_city_state
Dataframe airport_city_state
City | state
----------------
Madrid | España
我想加入这两个数据帧,看起来像这样:
I want to join these two dataframes so that it looks like this:
数据帧结果
Id | Name | City | state
--------------------------
1 | Barajas | Madrid | España
其中 dfairport.city = dfaiport_city_state.city
但是我无法用语法来澄清所以我可以正确地进行连接。关于我如何创建变量的一些代码:
But I can not clarify with the syntax so I can do the join correctly. A little code of how I have created the variables:
// Load the csv, you have to specify that you have header and what delimiter you have
Dataset <Row> dfairport = Load.Csv (sqlContext, data_airport);
Dataset <Row> dfairport_city_state = Load.Csv (sqlContext, data_airport_city_state);
// Change the name of the columns in the csv dataframe to match the columns in the database
// Once they match the name we can insert them
Dfairport
.withColumnRenamed ("leg_key", "id")
.withColumnRenamed ("leg_name", "name")
.withColumnRenamed ("leg_city", "city")
dfairport_city_state
.withColumnRenamed("city", "ciudad")
.withColumnRenamed("state", "estado");
推荐答案
首先,非常感谢您的回复。
First, thank you very much for your response.
我已经尝试了我的两个解决方案但没有一个工作,我收到以下错误:
方法dfairport_city_state(String)未定义类型ETL_Airport
I have tried both of my solutions but none of them work, I get the following error: The method dfairport_city_state (String) is undefined for the type ETL_Airport
我无法访问数据框的特定列以进行加入。
I can not access a specific column of the dataframe for join.
编辑:
已经有了为了加入,我把解决方案放在这里以防其他人帮忙;)
Already got to do the join, I put here the solution in case someone else helps;)
感谢您的一切和最好的问候
Thanks for everything and best regards
//Join de tablas en las que comparten ciudad
Dataset <Row> joined = dfairport.join(dfairport_city_state, dfairport.col("leg_city").equalTo(dfairport_city_state.col("city")));
这篇关于加入一个数据帧spark java的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!