Pyspark Isin函数 [英] Pyspark isin function

查看:1002
本文介绍了Pyspark Isin函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Spark的初学者.我正在使用Pyspark将旧的Python代码转换为Spark.

I am a beginner is Spark.I am converting my legacy Python code to Spark using Pyspark.

我想获得与下面的代码等效的Pyspark

I would like to get a Pyspark equivalent of the code below

usersofinterest = actdataall[actdataall['ORDValue'].isin(orddata['ORDER_ID'].unique())]['User ID']

actdataallorddata都是Spark数据帧.

Both, actdataall and orddata are Spark dataframes.

鉴于与它相关的缺点,我不想使用toPandas()函数.

I don't want to use toPandas() function given the drawback associated with it.

感谢您的帮助.

推荐答案

  • 如果两个数据框都很大,则应考虑使用内部联接作为过滤器:

    • If both dataframes are big, you should consider using an inner join which will work as a filter:

      首先让我们创建一个包含我们要保留的订单ID的数据框:

      First let's create a dataframe containing the order IDs we want to keep:

      orderid_df = orddata.select(orddata.ORDER_ID.alias("ORDValue")).distinct()
      

      现在,让我们将其与我们的actdataall数据框一起加入:

      Now let's join it with our actdataall dataframe:

      usersofinterest = actdataall.join(orderid_df, "ORDValue", "inner").select('User ID').distinct()
      

    • 如果您的订单ID的目标列表较小,则可以使用furianpandit帖子中提到的pyspark.sql isin函数,请不要忘记在使用变量之前广播变量(spark会将对象复制到每个对象节点使他们的任务快得多):

    • If your target list of order IDs is small then you can use the pyspark.sql isin function as mentioned in furianpandit's post, don't forget to broadcast your variable before using it (spark will copy the object to every node making their tasks a lot faster):

      orderid_list = orddata.select('ORDER_ID').distinct().rdd.flatMap(lambda x:x).collect()[0]
      sc.broadcast(orderid_list)
      

    • 这篇关于Pyspark Isin函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆