Spark有效地从小数据框中存在的大数据框中过滤条目 [英] Spark efficiently filtering entries from big dataframe that exist in a small dataframe

查看：39 发布时间：2021/2/12 19:41:46 performance join apache-spark apache-spark-sql

本文介绍了Spark有效地从小数据框中存在的大数据框中过滤条目的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个Spark程序，该程序读取一个包含2列的相对较大的数据帧(〜3.2 TB):id，name和另一个包含一列的相对较小的数据帧(〜20k条目)

I have a Spark program that reads a relatively big dataframe (~3.2 terabyte) that contains 2 columns: id, name and another relatively small dataframe (~20k entries) that contain a single column: id

我要尝试的是从大数据框中获取ID和名称(如果它们出现在小数据框中)

What I'm trying to do is take both the id and the name from the big dataframe if they appear in the small dataframe

我想知道什么是有效的解决方案，为什么?我想到了几个选择:

I was wondering what would be an efficient solution to get this working and why? Several options I had in mind:

广播加入2个数据框
广播小数据帧并将其收集为字符串数组，然后在大数据帧上进行过滤，并将isin与字符串数组一起使用

还有没有我在这里没有提及的其他选择吗?

Are there any other options that I didn't mention here?

如果有人能解释为什么一个特定的解决方案比另一个解决方案更有效，我将不胜感激

I'll appreciate it if someone could also explain why a specific solution is more efficient than the other

预先感谢

Spark有效地从小数据框中存在的大数据框中过滤条目 [英] Spark efficiently filtering entries from big dataframe that exist in a small dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark有效地从小数据框中存在的大数据框中过滤条目 [英] Spark efficiently filtering entries from big dataframe that exist in a small dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭