pyspark;检查元素是否在collect_list中 [英] pyspark; check if an element is in collect_list
问题描述
我正在处理数据框df
,例如以下数据框:
I am working on a dataframe df
, for instance the following dataframe:
df.show()
输出:
+----+------+
|keys|values|
+----+------+
| aa| apple|
| bb|orange|
| bb| desk|
| bb|orange|
| bb| desk|
| aa| pen|
| bb|pencil|
| aa| chair|
+----+------+
我使用collect_set
进行汇总,并获得一个消除了重复元素的对象集(或collect_list
来获取对象列表).
I use collect_set
to aggregate and get a set of objects with duplicate elements eliminated (or collect_list
to get list of objects).
df_new = df.groupby('keys').agg(collect_set(df.values).alias('collectedSet_values'))
结果数据帧如下:
df_new.show()
输出:
+----+----------------------+
|keys|collectedSet_values |
+----+----------------------+
|bb |[orange, pencil, desk]|
|aa |[apple, pen, chair] |
+----+----------------------+
我正在努力寻找一种方法来查看结果集中的对象(在列collectedSet_values
中)中是否存在特定的关键字(例如"chair").我不想使用udf
解决方案.
I am struggling to find a way to see if a specific keyword (like 'chair') is in the resulting set of objects (in column collectedSet_values
). I do not want to go with udf
solution.
请评论您的解决方案/想法.
Please comment your solutions/ideas.
亲切的问候.
推荐答案
实际上,有一个不错的函数array_contains
为我们做到了.我们将其用于对象集的方式与
Actually there is a nice function array_contains
which does that for us. The way we use it for set of objects is the same as in here. To know if word 'chair' exists in each set of object, we can simply do the following:
df_new.withColumn('contains_chair', array_contains(df_new.collectedSet_values, 'chair')).show()
输出:
+----+----------------------+--------------+
|keys|collectedSet_values |contains_chair|
+----+----------------------+--------------+
|bb |[orange, pencil, desk]|false |
|aa |[apple, pen, chair] |true |
+----+----------------------+--------------+
collect_list
的结果也是如此.
这篇关于pyspark;检查元素是否在collect_list中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!