一个列表作为 PySpark 的 reduceByKey 的键 [英] A list as a key for PySpark's reduceByKey

查看：28 发布时间：2021/11/12 5:38:33 python apache-spark rdd pyspark

本文介绍了一个列表作为 PySpark 的 reduceByKey 的键的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图对格式为 (([a,b,c], 1), ([a,b,c], 1), ([a,d,b,e], 1), ...

pyspark 似乎不接受数组作为普通键中的键，通过简单地应用 .reduceByKey(add) 来减少值.

我已经尝试先将数组转换为字符串，通过 .map((x,y): (str(x),y)) 但这不起作用，因为后处理字符串回数组太慢了.

有没有办法让 pyspark 使用数组作为键或使用另一个函数将字符串快速转换回数组?

这是相关的错误代码

 文件/home/jan/Documents/spark-1.4.0/python/lib/pyspark.zip/pyspark/shuffle.py"，第 268 行，在 mergeValues 中d[k] = comb(d[k], v) if k in d else creator(v)类型错误:不可散列的类型:列表"在此处输入代码

总结:

input:x =[([a,b,c], 1), ([a,b,c], 1), ([a,d,b,e], 1), ...]

期望输出:y =[([a,b,c], 2), ([a,d,b,e], 1),...]这样我就可以通过 y[0][0][0] 和 2 通过 y[0][1 访问 a]

解决方案

试试这个:

rdd.map(lambda (k, v): (tuple(k), v)).groupByKey()

由于 Python 列表是可变的，这意味着它不能被散列(不要提供 __hash__ 方法):

<预><代码>>>>a_list = [1, 2, 3]>>>a_list.__hash__ 是 None真的>>>哈希(a_list)回溯(最近一次调用最后一次):文件<stdin>"，第 1 行，在 <module> 中类型错误:不可散列的类型:列表"

另一方面的元组是不可变的，并提供__hash__方法实现:

<预><代码>>>>a_tuple = (1, 2, 3)>>>a_tuple.__hash__ 是 None错误的>>>哈希(a_tuple)2528502973977326415

因此可以用作密钥.同样，如果您想使用唯一值作为键，您应该使用 frozenset:

rdd.map(lambda (k, v): (frozenset(k), v)).groupByKey().collect()

而不是set.

# 这将失败并出现 TypeError: unhashable type: 'set'rdd.map(lambda (k, v): (set(k), v)).groupByKey().collect()

I am attempting to call the reduceByKey function of pyspark on data of the format (([a,b,c], 1), ([a,b,c], 1), ([a,d,b,e], 1), ...

It seems pyspark will not accept an array as the key in normal key, value reduction by simply applying .reduceByKey(add).

I have already tried first converting the array to a string, by .map((x,y): (str(x),y)) but this does not work because post processing of the strings back into arrays is too slow.

Is there a way I can make pyspark use the array as a key or use another function to quickly convert the strings back to arrays?

here is the associated error code

  File "/home/jan/Documents/spark-1.4.0/python/lib/pyspark.zip/pyspark/shuffle.py", line 268, in mergeValues
    d[k] = comb(d[k], v) if k in d else creator(v)
TypeError: unhashable type: 'list'
    enter code here

SUMMARY:

input:x =[([a,b,c], 1), ([a,b,c], 1), ([a,d,b,e], 1), ...]

desired output :y =[([a,b,c], 2), ([a,d,b,e], 1),...] such that I could access a by y[0][0][0] and 2 by y[0][1]

解决方案

Try this:

rdd.map(lambda (k, v): (tuple(k), v)).groupByKey()

Since Python lists are mutable it means that cannot be hashed (don't provide __hash__ method):

>>> a_list = [1, 2, 3]
>>> a_list.__hash__ is None
True
>>> hash(a_list)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

Tuples from the other hand are immutable and provide __hash__ method implementation:

>>> a_tuple = (1, 2, 3)
>>> a_tuple.__hash__ is None
False
>>> hash(a_tuple)
2528502973977326415

hence can be used as a key. Similarly if you want to use unique values as a key you should use frozenset:

rdd.map(lambda (k, v): (frozenset(k), v)).groupByKey().collect()

instead of set.

# This will fail with TypeError: unhashable type: 'set'
rdd.map(lambda (k, v): (set(k), v)).groupByKey().collect()

这篇关于一个列表作为 PySpark 的 reduceByKey 的键的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

一个列表作为 PySpark 的 reduceByKey 的键 [英] A list as a key for PySpark's reduceByKey

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

一个列表作为 PySpark 的 reduceByKey 的键 [英] A list as a key for PySpark&#39;s reduceByKey

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

一个列表作为 PySpark 的 reduceByKey 的键 [英] A list as a key for PySpark's reduceByKey

登录关闭