pyspark:isin vs加入 [英] pyspark: isin vs join
问题描述
通过给定的值列表过滤pyspark中的数据框的一般最佳实践是什么?具体来说:
What are general best-practices to filtering a dataframe in pyspark by a given list of values? Specifically:
取决于给定值列表的大小,那么关于运行时,什么时候最好使用isin
vs内部join
vs
broadcast
?
Depending on the size of the given list of values, then with respect to runtime when is it best to use isin
vs inner join
vs
broadcast
?
该问题与Pig中以下问题的火花类似物:
This question is the spark analogue of the following question in Pig:
其他上下文:
推荐答案
考虑
import pyspark.sql.functions as psf
广播有两种类型:
-
sc.broadcast()
将python对象复制到每个节点,以便更有效地使用psf.isin
-
psf.broadcast
,用于在数据帧较小时将pyspark数据帧复制到每个节点:df1.join(psf.broadcast(df2))
.通常用于笛卡尔产品(猪中的CROSS JOIN).
join
内的sc.broadcast()
to copy python objects to every node for a more efficient use ofpsf.isin
psf.broadcast
inside ajoin
to copy your pyspark dataframe to every node when the dataframe is small:df1.join(psf.broadcast(df2))
. It is usually used for cartesian products (CROSS JOIN in pig).
在上下文问题中,过滤是使用另一个数据框的列完成的,因此可以使用联接进行解决.
In the context question, the filtering was done using the column of another dataframe, hence the possible solution with a join.
请记住,如果您的过滤列表比较大,则搜索它的过程将需要一段时间,并且由于每行都执行了此操作,因此它很快就会变得昂贵.
Keep in mind that if your filtering list is relatively big the operation of searching through it will take a while, and since it has do be done for each row it can quickly get costly.
另一方面,联接包含两个要在匹配之前进行排序的数据帧,因此,如果您的列表足够小,您可能就不必为筛选器而对巨大的数据帧进行排序.
Joins on the other hand involve two dataframes that will be sorted before matching, so if your list is small enough you might not want to have to sort a huge dataframe just for a filter.
这篇关于pyspark:isin vs加入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!