如何在spark中找到2个不同数据帧之间的优化连接 [英] How to find an optimized join between 2 different dataframes in spark

查看：16 发布时间：2021/11/14 22:36:39 apache-spark pyspark apache-spark-sql

本文介绍了如何在spark中找到2个不同数据帧之间的优化连接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有 2 个不同的数据集， 我想加入它们，但没有简单的方法可以做到，因为它们没有公共列，并且当我们使用大数据.我已经在stackoverflow上问过这个问题，但我真的找不到一个优化的解决方案来加入他们.我关于 stackoverflow 的问题是:查看字符串是否包含不同数据帧中的子字符串

I have a 2 different datasets, I would like to join them, but there is no easy way to do it because they don't have a common column and the crossJoin not good solution when we use a bigdata. I already asked the question on stackoverflow, but really I couldn't find an optimized solution to join them. My question on stackoverflow is: looking if String contain a sub-string in differents Dataframes

我在下面看到了这些解决方案，但我没有找到适合我的案例的好方法.高效的字符串后缀检测高效的字符串后缀检测 Apache Spark 中的高效字符串匹配

I saw these solution bellow but I didn't find a good way for my case. Efficient string suffix detection Efficient string suffix detection Efficient string matching in Apache Spark

今天，我找到了一个有趣的解决方案 :) 我不确定它是否有效，但让我们尝试一下.

Today, I found a funny solution :) I'm not sure if it will be work, but let's try.

我在 df_1 中添加了一个新列以包含行编号.

I add a new column in df_1 to be contain numbering of lines.

示例 df_1:

name    | id
----------------
abc     | 1232
----------------
azerty  | 87564
----------------
google  | 374856
----------------

新 df_1:

name    | id       | new_id
----------------------------
abc     | 1232     |  1
----------------------------
azerty  | 87564    |  2
----------------------------
google  | 374856   |  3
----------------------------
explorer| 84763    |  4
----------------------------

df_2 也一样:

示例 df_2:

adress    |
-----------
UK        |
-----------
USA       |
-----------
EUROPE    |
-----------

新的 df_2:

adress    | new_id
-------------------
UK        |   1
-------------------
USA       |   2
-------------------
EUROPE    |   3
-------------------

现在，我在 2 个数据帧之间有一个公共列，我可以使用 new_id 作为 key 进行左连接.我的问题，这个解决方案有效吗?如何使用行编号在每个数据框中添加 new_id 列?

Now, I have a common column between the 2 dataframes, I can do a left join using a new_id as key. My question, is this solution efficient ? How can I add new_id columns in each dataframe with numbering of line ?

如何在spark中找到2个不同数据帧之间的优化连接 [英] How to find an optimized join between 2 different dataframes in spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在spark中找到2个不同数据帧之间的优化连接 [英] How to find an optimized join between 2 different dataframes in spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭