Spark有效地过滤小数据帧中存在的大数据帧中的条目 [英] Spark efficiently filtering entries from big dataframe that exist in a small dataframe

查看：20 发布时间：2021/11/14 22:59:02 performance join apache-spark apache-spark-sql

本文介绍了Spark有效地过滤小数据帧中存在的大数据帧中的条目的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个 Spark 程序，它读取一个相对较大的数据帧(~3.2 TB)，其中包含 2 列:id、name 和另一个包含单列的相对较小的数据帧(~20k 个条目):id

I have a Spark program that reads a relatively big dataframe (~3.2 terabyte) that contains 2 columns: id, name and another relatively small dataframe (~20k entries) that contain a single column: id

如果它们出现在小数据框中，我想要做的是从大数据框中获取 ID 和名称

What I'm trying to do is take both the id and the name from the big dataframe if they appear in the small dataframe

我想知道什么是有效的解决方案，为什么?我想到的几个选项:

I was wondering what would be an efficient solution to get this working and why? Several options I had in mind:

广播加入 2 个数据帧
广播小数据帧并将其收集为字符串数组，然后对大数据帧进行过滤并将 isin 与字符串数组一起使用

还有其他我没有提到的选项吗?

Are there any other options that I didn't mention here?

如果有人也能解释为什么一个特定的解决方案比另一个更有效，我将不胜感激

I'll appreciate it if someone could also explain why a specific solution is more efficient than the other

提前致谢

Spark有效地过滤小数据帧中存在的大数据帧中的条目 [英] Spark efficiently filtering entries from big dataframe that exist in a small dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark有效地过滤小数据帧中存在的大数据帧中的条目 [英] Spark efficiently filtering entries from big dataframe that exist in a small dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭