从数据帧Spark过滤大量ID [英] Filter a large number of IDs from a dataframe Spark

查看:48
本文介绍了从数据帧Spark过滤大量ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型数据框,其格式类似于

I have a large dataframe with a format similar to

+-----+------+------+
|ID   |Cat   |date  |
+-----+------+------+
|12   | A    |201602|
|14   | B    |201601|
|19   | A    |201608|
|12   | F    |201605|
|11   | G    |201603|
+-----+------+------+

,我需要根据具有大约500000个ID的列表过滤行.直接的方法是使用 isin 进行过滤,但这确实有很差的性能.怎么做这个过滤器?

and I need to filter rows based on a list with around 5000 thousand IDs. The straighforward way would be to filter with isin but that has really bad performance. How can this filter be done?

推荐答案

如果您承诺使用Spark SQL,并且 isin 不再扩展,那么内部均等连接应该是一个不错的选择.

If you're committed to using Spark SQL and isin doesn't scale anymore then inner equi-join should be a decent fit.

首先将id列表转换为单列 DataFrame .如果这是本地收藏

First convert id list to as single column DataFrame. If this is a local collection

ids_df = sc.parallelize(id_list).map(lambda x: (x, )).toDF(["id"])

join :

df.join(ids_df, ["ID"], "inner")

这篇关于从数据帧Spark过滤大量ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆