pyspark 数据框过滤器或基于列表包含 [英] pyspark dataframe filter or include based on list

查看：26 发布时间：2021/11/12 5:36:55 apache-spark filter pyspark apache-spark-sql

本文介绍了pyspark 数据框过滤器或基于列表包含的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用列表过滤 pyspark 中的数据框.我想根据列表进行过滤，或者仅包含列表中具有值的那些记录.我下面的代码不起作用:

I am trying to filter a dataframe in pyspark using a list. I want to either filter based on the list or include only those records with a value in the list. My code below does not work:

# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])

# define a list of scores
l = [10,18,20]

# filter out records by scores by list l
records = df.filter(df.score in l)
# expected: (0,1), (0,1), (0,2), (1,2)

# include only records with these scores in list l
records = df.where(df.score in l)
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)

出现以下错误:ValueError:无法将列转换为布尔值:请使用&"对于和"，|"构建 DataFrame 布尔表达式时，'or' 表示，'~' 表示'not'.

Gives the following error: ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

推荐答案

它说的是df.score in l"无法评估，因为 df.score 为您提供一列和in"未在该列类型上定义使用isin"

what it says is "df.score in l" can not be evaluated because df.score gives you a column and "in" is not defined on that column type use "isin"

代码应该是这样的:

# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])

# define a list of scores
l = [10,18,20]

# filter out records by scores by list l
records = df.filter(~df.score.isin(l))
# expected: (0,1), (0,1), (0,2), (1,2)

# include only records with these scores in list l
df.filter(df.score.isin(l))
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)

请注意where() 是 filter() 的别名，所以两者可以互换.

Note that where() is an alias for filter(), so both are interchangeable.

这篇关于pyspark 数据框过滤器或基于列表包含的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pyspark 数据框过滤器或基于列表包含 [英] pyspark dataframe filter or include based on list

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

pyspark 数据框过滤器或基于列表包含 [英] pyspark dataframe filter or include based on list

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭