pyspark:isin vs join [英] pyspark: isin vs join
问题描述
通过给定的值列表过滤 pyspark 中的数据帧的一般最佳实践是什么?具体:
What are general best-practices to filtering a dataframe in pyspark by a given list of values? Specifically:
取决于给定值列表的大小,然后关于运行时什么时候最好使用 isin
vs 内部 join
vs广播
?
Depending on the size of the given list of values, then with respect to runtime when is it best to use isin
vs inner join
vs
broadcast
?
这个问题与 Pig 中的以下问题类似:
This question is the spark analogue of the following question in Pig:
附加上下文:
推荐答案
考虑
import pyspark.sql.functions as psf
有两种类型的广播:
sc.broadcast()
将 python 对象复制到每个节点,以便更有效地使用psf.isin
psf.broadcast
在join
中,当数据帧很小时,将 pyspark 数据帧复制到每个节点:df1.join(psf.broadcast(df2))
.它通常用于笛卡尔积(猪中的 CROSS JOIN).
sc.broadcast()
to copy python objects to every node for a more efficient use ofpsf.isin
psf.broadcast
inside ajoin
to copy your pyspark dataframe to every node when the dataframe is small:df1.join(psf.broadcast(df2))
. It is usually used for cartesian products (CROSS JOIN in pig).
在上下文问题中,过滤是使用另一个数据框的列完成的,因此可能的解决方案是连接.
In the context question, the filtering was done using the column of another dataframe, hence the possible solution with a join.
请记住,如果您的过滤列表相对较大,则搜索它的操作将需要一段时间,而且由于确实对每一行都进行了搜索,因此它很快就会变得昂贵.
Keep in mind that if your filtering list is relatively big the operation of searching through it will take a while, and since it has do be done for each row it can quickly get costly.
另一方面,连接涉及两个将在匹配前进行排序的数据框,因此如果您的列表足够小,您可能不想只为了过滤器而对巨大的数据框进行排序.
Joins on the other hand involve two dataframes that will be sorted before matching, so if your list is small enough you might not want to have to sort a huge dataframe just for a filter.
这篇关于pyspark:isin vs join的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!