如何为数组列表的每个元素获取最常见的 (pyspark) [英] How to get most common for each element of array list (pyspark)
本文介绍了如何为数组列表的每个元素获取最常见的 (pyspark)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个数组列表,我需要为列表中的每个元素找到最高频率元素.对于下面的代码unhashable type:'list'"错误被抛出.但是我也尝试并行化结果列表但错误仍然存在.
I have a List of arrays for which I need to find highest frequency element for each element of the list.For following code "unhashable type: 'list'" error is thrown.However I have also tried to parallelize the results list but the error remains.
# [array(0,1,1),array(0,0,1),array(1,1,0)] example of list
def finalml(listn):
return Counter(listn).most_common(1)
# the array list is return by this
results = sn.rdd.map(lambda xw: bc_knnobj.value.kneighbors(xw, return_distance=False)).collect()
labels = results.map(lambda xw: finalml(xw)).collect()
预期输出[1,0,1]
推荐答案
试试这个:
x = [[0,1,1],[0,0,1],[1,1,0]]
df = spark.createDataFrame(x)
df.show()
输入df:
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 0| 1| 1|
| 0| 0| 1|
| 1| 1| 0|
+---+---+---+
import pyspark.sql.functions as F
@F.udf
def mode(x):
from collections import Counter
return Counter(x).most_common(1)[0][0]
cols = df.columns
agg_expr = [mode(F.collect_list(col)).alias(col) for col in cols]
df.groupBy().agg(*agg_expr).show()
输出df:
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 0| 1| 1|
+---+---+---+
这篇关于如何为数组列表的每个元素获取最常见的 (pyspark)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文