PySpark isin 函数 [英] PySpark isin function

查看:21
本文介绍了PySpark isin 函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 PySpark 将我的旧 Python 代码转换为 Spark.

I am converting my legacy Python code to Spark using PySpark.

我想要一个 PySpark 等价物:

I would like to get a PySpark equivalent of:

usersofinterest = actdataall[actdataall['ORDValue'].isin(orddata['ORDER_ID'].unique())]['User ID']

actdataallorddata 都是 Spark 数据帧.

Both, actdataall and orddata are Spark dataframes.

鉴于与它相关的缺点,我不想使用 toPandas() 函数.

I don't want to use toPandas() function given the drawback associated with it.

推荐答案

  • 如果两个数据框都很大,您应该考虑使用可用作过滤器的内部联接:

    • If both dataframes are big, you should consider using an inner join which will work as a filter:

      首先让我们创建一个包含我们想要保留的订单 ID 的数据框:

      First let's create a dataframe containing the order IDs we want to keep:

      orderid_df = orddata.select(orddata.ORDER_ID.alias("ORDValue")).distinct()
      

      现在让我们将它与我们的 actdataall 数据框结合起来:

      Now let's join it with our actdataall dataframe:

      usersofinterest = actdataall.join(orderid_df, "ORDValue", "inner").select('User ID').distinct()
      

    • 如果您的订单 ID 目标列表很小,那么您可以使用 furianpandit 的帖子中提到的 pyspark.sql isin 函数,不要忘记在使用之前广播您的变量(spark 会将对象复制到每个节点使他们的任务更快):

    • If your target list of order IDs is small then you can use the pyspark.sql isin function as mentioned in furianpandit's post, don't forget to broadcast your variable before using it (spark will copy the object to every node making their tasks a lot faster):

      orderid_list = orddata.select('ORDER_ID').distinct().rdd.flatMap(lambda x:x).collect()[0]
      sc.broadcast(orderid_list)
      

    • 这篇关于PySpark isin 函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆