如何在spark中找到2个不同数据帧之间的优化连接 [英] How to find an optimized join between 2 different dataframes in spark
问题描述
我有 2 个不同的数据集,
我想加入它们,但没有简单的方法可以做到,因为它们没有公共列,并且当我们使用大数据.我已经在stackoverflow上问过这个问题,但我真的找不到一个优化的解决方案来加入他们.我关于 stackoverflow 的问题是:查看字符串是否包含不同数据帧中的子字符串
I have a 2 different datasets,
I would like to join them, but there is no easy way to do it because they don't have a common column and the crossJoin not good solution when we use a bigdata. I already asked the question on stackoverflow, but really I couldn't find an optimized solution to join them. My question on stackoverflow is: looking if String contain a sub-string in differents Dataframes
我在下面看到了这些解决方案,但我没有找到适合我的案例的好方法.高效的字符串后缀检测高效的字符串后缀检测Apache Spark 中的高效字符串匹配
I saw these solution bellow but I didn't find a good way for my case. Efficient string suffix detection Efficient string suffix detection Efficient string matching in Apache Spark
今天,我找到了一个有趣的解决方案 :) 我不确定它是否有效,但让我们尝试一下.
Today, I found a funny solution :) I'm not sure if it will be work, but let's try.
我在 df_1
中添加了一个新列以包含行编号.
I add a new column in df_1
to be contain numbering of lines.
示例 df_1:
name | id
----------------
abc | 1232
----------------
azerty | 87564
----------------
google | 374856
----------------
新 df_1:
name | id | new_id
----------------------------
abc | 1232 | 1
----------------------------
azerty | 87564 | 2
----------------------------
google | 374856 | 3
----------------------------
explorer| 84763 | 4
----------------------------
df_2 也一样:
示例 df_2:
adress |
-----------
UK |
-----------
USA |
-----------
EUROPE |
-----------
新的 df_2:
adress | new_id
-------------------
UK | 1
-------------------
USA | 2
-------------------
EUROPE | 3
-------------------
现在,我在 2 个数据帧之间有一个公共列,我可以使用 new_id
作为 key
进行左连接.我的问题,这个解决方案有效吗?如何使用行编号在每个数据框中添加 new_id
列?
Now, I have a common column between the 2 dataframes, I can do a left join using a new_id
as key
.
My question, is this solution efficient ?
How can I add new_id
columns in each dataframe with numbering of line ?
推荐答案
由于 Spark 是惰性求值,这意味着在触发操作之前不会开始执行.因此,您可以做的只是调用 spark 上下文 createdataframe 函数并传递来自 df1 和 df2 的选定列的列表.它将根据您的需要创建一个新的数据框.
As the Spark is Lazy Evaluation ,it means that the execution will not start until an action is triggered . So what you can do is simply call spark context createdataframe function and pass list of selected columns from df1 and df2 . It will create a new dataframe as you need.
例如df3 = spark.createDataframe([df1.select(''),df2.select('')])
有效果就点赞
这篇关于如何在spark中找到2个不同数据帧之间的优化连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!