名单为PySpark的reduceByKey的关键 [英] A list as a key for PySpark's reduceByKey

查看：2712 发布时间：2016/5/22 15:12:39 python apache-spark rdd pyspark

本文介绍了名单为PySpark的reduceByKey的关键的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图打电话给pyspark的reduceByKey功能对格式（（[A，B，C]，1），（数据[A，B，C]，1），（ [A，D，b，E]，1），...

I am attempting to call the reduceByKey function of pyspark on data of the format (([a,b,c], 1), ([a,b,c], 1), ([a,d,b,e], 1), ...

看来pyspark不会被简单地应用.reduceByKey（添加）接受数组作为正常键，值降低的关键。

It seems pyspark will not accept an array as the key in normal key, value reduction by simply applying .reduceByKey(add).

我已经尝试先将数组转换为字符串，由 .MAP（（X，Y）：（STR（X），Y））但这确实没有工作，因为字符串的后处理回阵列太慢了。

I have already tried first converting the array to a string, by .map((x,y): (str(x),y)) but this does not work because post processing of the strings back into arrays is too slow.

有没有一种办法可以让pyspark使用数组作为键或使用其他功能，快速转换的字符串返回数组？

Is there a way I can make pyspark use the array as a key or use another function to quickly convert the strings back to arrays?

下面是相关联的错误code

here is the associated error code

  File "/home/jan/Documents/spark-1.4.0/python/lib/pyspark.zip/pyspark/shuffle.py", line 268, in mergeValues
    d[k] = comb(d[k], v) if k in d else creator(v)
TypeError: unhashable type: 'list'
    enter code here

摘要

输入： X = [（A，B，C]，1），（[A，B，C]，1），（[A，D，B，E]，1 ），...]

所需的输出： Y = [（A，B，C]，2），（[A，D，B，E]，1），...]
这样我可以访问 A 按 Y [0] [0] [0] 和 2 按 Y [0] [1]

推荐答案

试试这个：

rdd.map(lambda (k, v): (tuple(k), v)).groupByKey()

由于Python列表是可变的，这意味着不能被散列（不提供 __ __哈希法）：

>>> a_list = [1, 2, 3]
>>> a_list.__hash__ is None
True
>>> hash(a_list)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

元组另一方面是不可变的，并提供 __ __哈希方法实现：

>>> a_tuple = (1, 2, 3)
>>> a_tuple.__hash__ is None
False
>>> hash(a_tuple)
2528502973977326415

因此可以被用来作为密钥。同样，如果你想使用的唯一值，你应该使用一个密钥 frozenset ：

rdd.map(lambda (k, v): (frozenset(k), v)).groupByKey().collect()

而不是设置。

# This will fail with TypeError: unhashable type: 'set'
rdd.map(lambda (k, v): (set(k), v)).groupByKey().collect()

这篇关于名单为PySpark的reduceByKey的关键的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

名单为PySpark的reduceByKey的关键 [英] A list as a key for PySpark's reduceByKey

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

名单为PySpark的reduceByKey的关键 [英] A list as a key for PySpark&#39;s reduceByKey

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

名单为PySpark的reduceByKey的关键 [英] A list as a key for PySpark's reduceByKey

登录关闭