Pyspark Dataframe从具有字符串作为元素列表的列中获取唯一元素 [英] Pyspark Dataframe get unique elements from column with string as list of elements

查看:86
本文介绍了Pyspark Dataframe从具有字符串作为元素列表的列中获取唯一元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框(该数据框是通过从多个blob中加载加载而创建的),其中有一个列是ID列表.现在,我希望从整个列中获得唯一ID的列表:

I have a dataframe (which is created by loading from multiple blobs in azure) where I have a column which is list of IDs. Now, I want a list of unique IDs from this entire column:

这里是一个例子-

df - 
| col1 | col2 | col3  |
| "a"  | "b"  |"[q,r]"|
| "c"  | "f"  |"[s,r]"|

这是我的预期答复:

resp = [q, r, s]

有人知道如何到达那里吗?

Any idea how to get there?

我目前的方法是将col3中的字符串转换为python列表,然后也许以某种方式将其变平.

My current approach is to convert the strings in col3 to python lists and then maybe flaten them out somehow.

但是到目前为止,我还不能这样做.我尝试在pyspark中使用用户定义的函数,但它们仅返回字符串,而不返回列表.

But so far I am not able to do so. I tried using user defined functions in pyspark but they only return strings and not lists.

FlatMaps仅适用于RDD,而不适用于Dataframe,因此它们不可见.

FlatMaps only work on RDD not on Dataframes so they are out of picture.

也许我可以在从RDD到数据帧的转换过程中指定此方法.但不确定如何做到这一点.

Maybe there is way where I can specify this during the conversion from RDD to dataframe. But not sure how to do that.

推荐答案

这里是仅使用DataFrame函数的方法:

Here is a method using only DataFrame functions:

df = spark.createDataFrame([('a','b','[q,r,p]'),('c','f','[s,r]')],['col1','col2','col3'])

df=df.withColumn('col4', f.split(f.regexp_extract('col3', '\[(.*)\]',1), ','))

df.select(f.explode('col4').alias('exploded')).groupby('exploded').count().show()

这篇关于Pyspark Dataframe从具有字符串作为元素列表的列中获取唯一元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆