pyspark:isin vs加入 [英] pyspark: isin vs join

查看:119
本文介绍了pyspark:isin vs加入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通过给定的值列表过滤pyspark中的数据框的一般最佳实践是什么?具体来说:

What are general best-practices to filtering a dataframe in pyspark by a given list of values? Specifically:

取决于给定值列表的大小,那么关于运行时,什么时候最好使用isin vs内部join vs broadcast?

Depending on the size of the given list of values, then with respect to runtime when is it best to use isin vs inner join vs broadcast?

该问题与Pig中以下问题的火花类似物:

This question is the spark analogue of the following question in Pig:

Pig:按加载的列表进行有效过滤

其他上下文:

Pyspark isin函数

推荐答案

考虑

import pyspark.sql.functions as psf

广播有两种类型:

  • sc.broadcast()将python对象复制到每个节点,以便更有效地使用psf.isin
  • join内的
  • psf.broadcast,用于在数据帧较小时将pyspark数据帧复制到每个节点:df1.join(psf.broadcast(df2)).通常用于笛卡尔产品(猪中的CROSS JOIN).
  • sc.broadcast() to copy python objects to every node for a more efficient use of psf.isin
  • psf.broadcast inside a join to copy your pyspark dataframe to every node when the dataframe is small: df1.join(psf.broadcast(df2)). It is usually used for cartesian products (CROSS JOIN in pig).

在上下文问题中,过滤是使用另一个数据框的列完成的,因此可以使用联接进行解决.

In the context question, the filtering was done using the column of another dataframe, hence the possible solution with a join.

请记住,如果您的过滤列表比较大,则搜索它的过程将需要一段时间,并且由于每行都执行了此操作,因此它很快就会变得昂贵.

Keep in mind that if your filtering list is relatively big the operation of searching through it will take a while, and since it has do be done for each row it can quickly get costly.

另一方面,联接包含两个要在匹配之前进行排序的数据帧,因此,如果您的列表足够小,您可能就不必为筛选器而对巨大的数据帧进行排序.

Joins on the other hand involve two dataframes that will be sorted before matching, so if your list is small enough you might not want to have to sort a huge dataframe just for a filter.

这篇关于pyspark:isin vs加入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆