Pyspark过滤器数据框按另一个数据框的列 [英] Pyspark filter dataframe by columns of another dataframe

查看：87 发布时间：2020/9/3 23:26:42 python-2.7 apache-spark dataframe pyspark apache-spark-sql

本文介绍了Pyspark过滤器数据框按另一个数据框的列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

不知道为什么我在这方面遇到困难，考虑到在R或熊猫中相当容易做到，这似乎很简单.我想避免使用熊猫，因为我要处理大量数据，并且我相信toPandas()将所有数据加载到pyspark驱动程序的内存中.

Not sure why I'm having a difficult time with this, it seems so simple considering it's fairly easy to do in R or pandas. I wanted to avoid using pandas though since I'm dealing with a lot of data, and I believe toPandas() loads all the data into the driver’s memory in pyspark.

我有2个数据帧:df1和df2.我要过滤df1(删除所有行)，其中df1.userid = df2.userid和df1.group = df2.group.我不确定是否应该使用filter()，join()或sql例如:

I have 2 dataframes: df1 and df2. I want to filter df1 (remove all rows) where df1.userid = df2.userid AND df1.group = df2.group. I wasn't sure if I should use filter(), join(), or sql For example:

df1:
+------+----------+--------------------+
|userid|   group  |      all_picks     |
+------+----------+--------------------+
|   348|         2|[225, 2235, 2225]   |
|   567|         1|[1110, 1150]        |
|   595|         1|[1150, 1150, 1150]  |
|   580|         2|[2240, 2225]        |
|   448|         1|[1130]              |
+------+----------+--------------------+

df2:
+------+----------+---------+
|userid|   group  |   pick  |
+------+----------+---------+
|   348|         2|     2270|
|   595|         1|     2125|
+------+----------+---------+

Result I want:
+------+----------+--------------------+
|userid|   group  |      all_picks     |
+------+----------+--------------------+
|   567|         1|[1110, 1150]        |
|   580|         2|[2240, 2225]        |
|   448|         1|[1130]              |
+------+----------+--------------------+

我已经尝试了许多join()和filter()函数，我认为最接近的是:

I've tried many join() and filter() functions, I believe the closest I got was:

cond = [df1.userid == df2.userid, df2.group == df2.group]
df1.join(df2, cond, 'left_outer').select(df1.userid, df1.group, df1.all_picks) # Result has 7 rows

我尝试了一堆不同的联接类型，并且还尝试了不同的cond值: cond =(((df1.userid == df2.userid)&(df2.group == df2.group))#结果有7行 cond =((df1.userid！= df2.userid)&(df2.group！= df2.group))#结果有2行

I tried a bunch of different join types, and I also tried different cond values: cond = ((df1.userid == df2.userid) & (df2.group == df2.group)) # result has 7 rows cond = ((df1.userid != df2.userid) & (df2.group != df2.group)) # result has 2 rows

但是，似乎联接正在添加其他行，而不是删除.

However, it seems like the joins are adding additional rows, rather than deleting.

我正在使用python 2.7和spark 2.1.0

Pyspark过滤器数据框按另一个数据框的列 [英] Pyspark filter dataframe by columns of another dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Pyspark过滤器数据框按另一个数据框的列 [英] Pyspark filter dataframe by columns of another dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭