如何理解Spark中的reduceByKey? [英] How to understand reduceByKey in Spark?

查看：191 发布时间：2020/9/4 7:45:24 python apache-spark functional-programming pyspark documentation

本文介绍了如何理解Spark中的reduceByKey?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试学习Spark，到目前为止进展顺利，除了需要在一对RDD(其值为列表)上使用reduceByKey或CombineByKey之类的功能的问题之外.

I am trying to learn Spark and it has been going well so far, except for problems where I need to use functions like reduceByKey or combineByKey on a pair RDD whose values are lists.

我一直在尝试找到这些功能的详细文档，这些文档可以解释参数的实际含义，以便我自己解决问题而无需进行Stack Overflow，但我只是找不到关于Spark的任何好的文档.我已经阅读了Learning Spark的第3章和第4章，但是老实说，最复杂的功能的解释非常糟糕.

I have been trying to find detailed documentation for these functions, that could explain what the arguments actually are, so that I could solve it myself without going to Stack Overflow, but I just cannot find any good documentation for Spark. I have read chapters 3 and 4 from Learning Spark, but to be honest, the explanations for the most complicated functions are very bad.

我现在要解决的问题如下:我有一对RDD，其中键是字符串，值是两个都是整数的元素的列表.诸如此类:(国家/地区，[小时，计数]).对于每个键，无论小时如何，我都希望仅保留计数最高的值.一旦有了上述格式的RDD，我就会尝试通过在Spark中调用以下函数来找到最大值:

The problem I am dealing with right now is the following: I have a pair RDD where the key is a string and the value is a list of two elements which are both integers. Something like this: (country, [hour, count]). For each key, I wish to keep only the value with the highest count, regardless of the hour. As soon as I have the RDD in the format above, I try to find the maximums by calling the following function in Spark:

reduceByKey(lambda x, y: max(x[1], y[1]))

但这会引发以下错误:

TypeError: 'int' object is not subscriptable

这对我没有任何意义.我将参数x和y解释为两个键的值，例如x = [13，445]和y = [14，109]，但是错误没有任何意义.我在做什么错了?

Which does not make any sense to me. I interpreted the arguments x and y as being the values of two keys, e.g. x=[13, 445] and y=[14, 109], but then the error does not make any sense. What am I doing wrong?

推荐答案

假设您有[("key", [13,445]), ("key", [14,109]), ("key", [15,309])]

将其传递给reduceByKey时，它将将具有相同密钥的所有值分组到一个执行程序中，即[13,445], [14,109], [15,309]并在这些值之间进行迭代

When this is passed to reduceByKey, it will group all the values with same key into one executor i.e. [13,445], [14,109], [15,309] and iterates among the values

在第一个迭代中，x是[13,445]，y是[14,109]，输出是max(x[1], y[1])，即max(445, 109)，它是445

In the first iterate x is [13,445] and y is [14,109] and the output is max(x[1], y[1]) i.e. max(445, 109) which is 445

在第二个迭代中，x是445，即上一个循环的 max ，而y是[15,309]

In the second iterate x is 445 i.e. max of previous loop and y is [15,309]

现在，当尝试通过x[1]获取x的第二个元素并且445只是integer时，会发生错误

Now when the second element of x is tried to be obtained by x[1] and 445 is just an integer, the error occurs

TypeError:"int"对象不可下标

TypeError: 'int' object is not subscriptable

我希望错误的含义是清楚的.您可以在我的其他答案

I hope the meaning of the error is clear. You can find more details in my other answer

上面的解释还解释了@pault在注释部分中提出的解决方案为何有效，即

The above explanation also explains why the solution proposed by @pault in the comments section works i.e.

reduceByKey(lambda x, y: (x[0], max(x[1], y[1])))

这篇关于如何理解Spark中的reduceByKey?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何理解Spark中的reduceByKey? [英] How to understand reduceByKey in Spark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何理解Spark中的reduceByKey? [英] How to understand reduceByKey in Spark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭