pyspark数据框过滤器或包含基于列表 [英] pyspark dataframe filter or include based on list
问题描述
我正在尝试使用列表过滤pyspark中的数据框.我想基于列表进行筛选,或者仅在列表中包含具有值的那些记录.我的以下代码无法正常工作:
I am trying to filter a dataframe in pyspark using a list. I want to either filter based on the list or include only those records with a value in the list. My code below does not work:
# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])
# define a list of scores
l = [10,18,20]
# filter out records by scores by list l
records = df.filter(df.score in l)
# expected: (0,1), (0,1), (0,2), (1,2)
# include only records with these scores in list l
records = df.where(df.score in l)
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)
给出以下错误: ValueError:无法将列转换为bool:请使用'&'代表和","|"构建DataFrame布尔表达式时为'or',为'〜'为'not'.
Gives the following error: ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
推荐答案
它说的是"l中的df.score",因为df.score会为您提供一列,而该列中未定义"in",因此无法评估输入"isin"
what it says is "df.score in l" can not be evaluated because df.score gives you a column and "in" is not defined on that column type use "isin"
代码应如下所示:
# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])
# define a list of scores
l = [10,18,20]
# filter out records by scores by list l
records = df.filter(~df.score.isin(l))
# expected: (0,1), (0,1), (0,2), (1,2)
# include only records with these scores in list l
df.where(df.score.isin(l))
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)
这篇关于pyspark数据框过滤器或包含基于列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!