火花;检查元素是否在 collect_list 中 [英] pyspark; check if an element is in collect_list
问题描述
我正在处理数据帧 df
,例如以下数据帧:
I am working on a dataframe df
, for instance the following dataframe:
df.show()
输出:
+----+------+
|keys|values|
+----+------+
| aa| apple|
| bb|orange|
| bb| desk|
| bb|orange|
| bb| desk|
| aa| pen|
| bb|pencil|
| aa| chair|
+----+------+
我使用 collect_set
来聚合并获得一个消除重复元素的对象集(或使用 collect_list
来获取对象列表).
I use collect_set
to aggregate and get a set of objects with duplicate elements eliminated (or collect_list
to get list of objects).
df_new = df.groupby('keys').agg(collect_set(df.values).alias('collectedSet_values'))
结果数据框如下:
df_new.show()
输出:
+----+----------------------+
|keys|collectedSet_values |
+----+----------------------+
|bb |[orange, pencil, desk]|
|aa |[apple, pen, chair] |
+----+----------------------+
我正在努力寻找一种方法来查看特定关键字(如椅子")是否在结果对象集中(在 collectedSet_values
列中).我不想使用 udf
解决方案.
I am struggling to find a way to see if a specific keyword (like 'chair') is in the resulting set of objects (in column collectedSet_values
). I do not want to go with udf
solution.
请评论您的解决方案/想法.
Please comment your solutions/ideas.
亲切的问候.
推荐答案
实际上有一个很好的函数 array_contains
可以为我们做这件事.我们将它用于对象集的方式与此处.要知道每组对象中是否存在椅子"一词,我们可以简单地执行以下操作:
Actually there is a nice function array_contains
which does that for us. The way we use it for set of objects is the same as in here. To know if word 'chair' exists in each set of object, we can simply do the following:
df_new.withColumn('contains_chair', array_contains(df_new.collectedSet_values, 'chair')).show()
输出:
+----+----------------------+--------------+
|keys|collectedSet_values |contains_chair|
+----+----------------------+--------------+
|bb |[orange, pencil, desk]|false |
|aa |[apple, pen, chair] |true |
+----+----------------------+--------------+
同样适用于 collect_list
的结果.
The same applies to the result of collect_list
.
这篇关于火花;检查元素是否在 collect_list 中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!