Pyspark 按另一个数据帧的列过滤数据帧 [英] Pyspark filter dataframe by columns of another dataframe
问题描述
不知道为什么我在这方面遇到困难,考虑到在 R 或 Pandas 中很容易做到,这似乎很简单.我想避免使用 Pandas,因为我要处理大量数据,而且我相信 toPandas()
会将所有数据加载到 pyspark 中的驱动程序内存中.
Not sure why I'm having a difficult time with this, it seems so simple considering it's fairly easy to do in R or pandas. I wanted to avoid using pandas though since I'm dealing with a lot of data, and I believe toPandas()
loads all the data into the driver’s memory in pyspark.
我有 2 个数据帧:df1
和 df2
.我想过滤 df1
(删除所有行),其中 df1.userid = df2.userid
AND df1.group = df2.group
.我不确定是否应该使用 filter()
、join()
或 sql
例如:
I have 2 dataframes: df1
and df2
. I want to filter df1
(remove all rows) where df1.userid = df2.userid
AND df1.group = df2.group
. I wasn't sure if I should use filter()
, join()
, or sql
For example:
df1:
+------+----------+--------------------+
|userid| group | all_picks |
+------+----------+--------------------+
| 348| 2|[225, 2235, 2225] |
| 567| 1|[1110, 1150] |
| 595| 1|[1150, 1150, 1150] |
| 580| 2|[2240, 2225] |
| 448| 1|[1130] |
+------+----------+--------------------+
df2:
+------+----------+---------+
|userid| group | pick |
+------+----------+---------+
| 348| 2| 2270|
| 595| 1| 2125|
+------+----------+---------+
Result I want:
+------+----------+--------------------+
|userid| group | all_picks |
+------+----------+--------------------+
| 567| 1|[1110, 1150] |
| 580| 2|[2240, 2225] |
| 448| 1|[1130] |
+------+----------+--------------------+
我尝试了很多 join() 和 filter() 函数,我相信我得到的最接近的是:
I've tried many join() and filter() functions, I believe the closest I got was:
cond = [df1.userid == df2.userid, df2.group == df2.group]
df1.join(df2, cond, 'left_outer').select(df1.userid, df1.group, df1.all_picks) # Result has 7 rows
我尝试了很多不同的连接类型,也尝试了不同的
I tried a bunch of different join types, and I also tried different
cond values:
cond = ((df1.userid == df2.userid) & (df2.group == df2.group)) # result has 7 rows
cond = ((df1.userid != df2.userid) & (df2.group != df2.group)) # result has 2 rows
但是,连接似乎是在添加额外的行,而不是删除.
However, it seems like the joins are adding additional rows, rather than deleting.
我正在使用 python 2.7
和 spark 2.1.0
推荐答案
Left anti join 正是您要找的:
Left anti join is what you're looking for:
df1.join(df2, ["userid", "group"], "leftanti")
但是同样的事情可以用左外连接来完成:
but the same thing can be done with left outer join:
(df1
.join(df2, ["userid", "group"], "leftouter")
.where(df2["pick"].isNull())
.drop(df2["pick"]))
这篇关于Pyspark 按另一个数据帧的列过滤数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!