Spark有效地从小数据框中存在的大数据框中过滤条目 [英] Spark efficiently filtering entries from big dataframe that exist in a small dataframe

查看:39
本文介绍了Spark有效地从小数据框中存在的大数据框中过滤条目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Spark程序,该程序读取一个包含2列的相对较大的数据帧(〜3.2 TB):id,name和另一个包含一列的相对较小的数据帧(〜20k条目)

I have a Spark program that reads a relatively big dataframe (~3.2 terabyte) that contains 2 columns: id, name and another relatively small dataframe (~20k entries) that contain a single column: id

我要尝试的是从大数据框中获取ID和名称(如果它们出现在小数据框中)

What I'm trying to do is take both the id and the name from the big dataframe if they appear in the small dataframe

我想知道什么是有效的解决方案,为什么?我想到了几个选择:

I was wondering what would be an efficient solution to get this working and why? Several options I had in mind:

  1. 广播加入2个数据框
  2. 广播小数据帧并将其收集为字符串数组,然后在大数据帧上进行过滤,并将isin与字符串数组一起使用

还有没有我在这里没有提及的其他选择吗?

Are there any other options that I didn't mention here?

如果有人能解释为什么一个特定的解决方案比另一个解决方案更有效,我将不胜感激

I'll appreciate it if someone could also explain why a specific solution is more efficient than the other

预先感谢

推荐答案

AFAIK完全取决于您要处理的数据大小和性能,

AFAIK its all depends on the size of data you are handling and performance ,

  • 如果使用broadcast函数,则默认大小为10mb(有关通过spark.sql.autobroadcastjointhreshhold的小型数据框,请参见此处

  • if you use broadcast function then default size is 10mb (for your small dataframe via spark.sql.autobroadcastjointhreshhold see my answer ) you can increase or decrease the size based on your data. Also, braodcasted data will be part of sql execution plan and further will be pointer to catalyst optimizer to do further optimization. Also see my answer here

其中广播共享变量(您要使用isin)没有上述优势.

where as broadcast shared variable (which you want to use isin) doesnt have above advantage.

请在我的评论的上方链接中查看我的答案

pls see my answer in above link in my comment

这篇关于Spark有效地从小数据框中存在的大数据框中过滤条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆