Pyspark Isin函数 [英] Pyspark isin function
问题描述
我是Spark的初学者.我正在使用Pyspark将旧的Python代码转换为Spark.
I am a beginner is Spark.I am converting my legacy Python code to Spark using Pyspark.
我想获得与下面的代码等效的Pyspark
I would like to get a Pyspark equivalent of the code below
usersofinterest = actdataall[actdataall['ORDValue'].isin(orddata['ORDER_ID'].unique())]['User ID']
actdataall
和orddata
都是Spark数据帧.
Both, actdataall
and orddata
are Spark dataframes.
鉴于与它相关的缺点,我不想使用toPandas()
函数.
I don't want to use toPandas()
function given the drawback associated with it.
感谢您的帮助.
推荐答案
-
如果两个数据框都很大,则应考虑使用内部联接作为过滤器:
If both dataframes are big, you should consider using an inner join which will work as a filter:
首先让我们创建一个包含我们要保留的订单ID的数据框:
First let's create a dataframe containing the order IDs we want to keep:
orderid_df = orddata.select(orddata.ORDER_ID.alias("ORDValue")).distinct()
现在,让我们将其与我们的actdataall数据框一起加入:
Now let's join it with our actdataall dataframe:
usersofinterest = actdataall.join(orderid_df, "ORDValue", "inner").select('User ID').distinct()
-
如果您的订单ID的目标列表较小,则可以使用furianpandit帖子中提到的pyspark.sql isin函数,请不要忘记在使用变量之前广播变量(spark会将对象复制到每个对象节点使他们的任务快得多):
If your target list of order IDs is small then you can use the pyspark.sql isin function as mentioned in furianpandit's post, don't forget to broadcast your variable before using it (spark will copy the object to every node making their tasks a lot faster):
orderid_list = orddata.select('ORDER_ID').distinct().rdd.flatMap(lambda x:x).collect()[0] sc.broadcast(orderid_list)
这篇关于Pyspark Isin函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!