PySpark isin 函数 [英] PySpark isin function
问题描述
我正在使用 PySpark 将我的旧 Python 代码转换为 Spark.
I am converting my legacy Python code to Spark using PySpark.
我想要一个 PySpark 等价物:
I would like to get a PySpark equivalent of:
usersofinterest = actdataall[actdataall['ORDValue'].isin(orddata['ORDER_ID'].unique())]['User ID']
actdataall
和 orddata
都是 Spark 数据帧.
Both, actdataall
and orddata
are Spark dataframes.
鉴于与它相关的缺点,我不想使用 toPandas()
函数.
I don't want to use toPandas()
function given the drawback associated with it.
推荐答案
如果两个数据框都很大,您应该考虑使用可用作过滤器的内部联接:
If both dataframes are big, you should consider using an inner join which will work as a filter:
首先让我们创建一个包含我们想要保留的订单 ID 的数据框:
First let's create a dataframe containing the order IDs we want to keep:
orderid_df = orddata.select(orddata.ORDER_ID.alias("ORDValue")).distinct()
现在让我们将它与我们的 actdataall 数据框结合起来:
Now let's join it with our actdataall dataframe:
usersofinterest = actdataall.join(orderid_df, "ORDValue", "inner").select('User ID').distinct()
如果您的订单 ID 目标列表很小,那么您可以使用 furianpandit 的帖子中提到的 pyspark.sql isin 函数,不要忘记在使用之前广播您的变量(spark 会将对象复制到每个节点使他们的任务更快):
If your target list of order IDs is small then you can use the pyspark.sql isin function as mentioned in furianpandit's post, don't forget to broadcast your variable before using it (spark will copy the object to every node making their tasks a lot faster):
orderid_list = orddata.select('ORDER_ID').distinct().rdd.flatMap(lambda x:x).collect()[0] sc.broadcast(orderid_list)
这篇关于PySpark isin 函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!