pyspark;检查元素是否在collect_list中 [英] pyspark; check if an element is in collect_list

查看:429
本文介绍了pyspark;检查元素是否在collect_list中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理数据框df,例如以下数据框:

I am working on a dataframe df, for instance the following dataframe:

df.show()

输出:

+----+------+
|keys|values|
+----+------+
|  aa| apple|
|  bb|orange|
|  bb|  desk|
|  bb|orange|
|  bb|  desk|
|  aa|   pen|
|  bb|pencil|
|  aa| chair|
+----+------+

我使用collect_set进行汇总,并获得一个消除了重复元素的对象集(或collect_list来获取对象列表).

I use collect_set to aggregate and get a set of objects with duplicate elements eliminated (or collect_list to get list of objects).

df_new = df.groupby('keys').agg(collect_set(df.values).alias('collectedSet_values'))

结果数据帧如下:

df_new.show()

输出:

+----+----------------------+
|keys|collectedSet_values   |
+----+----------------------+
|bb  |[orange, pencil, desk]|
|aa  |[apple, pen, chair]   |
+----+----------------------+

我正在努力寻找一种方法来查看结果集中的对象(在列collectedSet_values中)中是否存在特定的关键字(例如"chair").我不想使用udf解决方案.

I am struggling to find a way to see if a specific keyword (like 'chair') is in the resulting set of objects (in column collectedSet_values). I do not want to go with udf solution.

请评论您的解决方案/想法.

Please comment your solutions/ideas.

亲切的问候.

推荐答案

实际上,有一个不错的函数array_contains为我们做到了.我们将其用于对象集的方式与

Actually there is a nice function array_contains which does that for us. The way we use it for set of objects is the same as in here. To know if word 'chair' exists in each set of object, we can simply do the following:

df_new.withColumn('contains_chair', array_contains(df_new.collectedSet_values, 'chair')).show()

输出:

+----+----------------------+--------------+
|keys|collectedSet_values   |contains_chair|
+----+----------------------+--------------+
|bb  |[orange, pencil, desk]|false         |
|aa  |[apple, pen, chair]   |true          |
+----+----------------------+--------------+

collect_list的结果也是如此.

这篇关于pyspark;检查元素是否在collect_list中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆