名单为PySpark的reduceByKey的关键 [英] A list as a key for PySpark's reduceByKey
问题描述
我试图打电话给pyspark的reduceByKey功能对格式(([A,B,C],1),(数据[A,B,C],1),( [A,D,b,E],1),...
I am attempting to call the reduceByKey function of pyspark on data of the format (([a,b,c], 1), ([a,b,c], 1), ([a,d,b,e], 1), ...
看来pyspark不会被简单地应用.reduceByKey(添加)接受数组作为正常键,值降低的关键。
It seems pyspark will not accept an array as the key in normal key, value reduction by simply applying .reduceByKey(add).
我已经尝试先将数组转换为字符串,由 .MAP((X,Y):(STR(X),Y))
但这确实没有工作,因为字符串的后处理回阵列太慢了。
I have already tried first converting the array to a string, by .map((x,y): (str(x),y))
but this does not work because post processing of the strings back into arrays is too slow.
有没有一种办法可以让pyspark使用数组作为键或使用其他功能,快速转换的字符串返回数组?
Is there a way I can make pyspark use the array as a key or use another function to quickly convert the strings back to arrays?
下面是相关联的错误code
here is the associated error code
File "/home/jan/Documents/spark-1.4.0/python/lib/pyspark.zip/pyspark/shuffle.py", line 268, in mergeValues
d[k] = comb(d[k], v) if k in d else creator(v)
TypeError: unhashable type: 'list'
enter code here
摘要
输入: X = [(A,B,C],1),([A,B,C],1),([A,D,B,E],1 ),...]
所需的输出: Y = [(A,B,C],2),([A,D,B,E],1),...]
这样我可以访问 A
按 Y [0] [0] [0]
和 2
按 Y [0] [1]
推荐答案
试试这个:
rdd.map(lambda (k, v): (tuple(k), v)).groupByKey()
由于Python列表是可变的,这意味着不能被散列(不提供 __ __哈希
法):
>>> a_list = [1, 2, 3]
>>> a_list.__hash__ is None
True
>>> hash(a_list)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
元组另一方面是不可变的,并提供 __ __哈希
方法实现:
>>> a_tuple = (1, 2, 3)
>>> a_tuple.__hash__ is None
False
>>> hash(a_tuple)
2528502973977326415
因此可以被用来作为密钥。同样,如果你想使用的唯一值,你应该使用一个密钥 frozenset
:
rdd.map(lambda (k, v): (frozenset(k), v)).groupByKey().collect()
而不是设置
。
# This will fail with TypeError: unhashable type: 'set'
rdd.map(lambda (k, v): (set(k), v)).groupByKey().collect()
这篇关于名单为PySpark的reduceByKey的关键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!