pyspark和reduceByKey:如何做一个简单的求和 [英] pyspark and reduceByKey: how to make a simple sum
问题描述
我正在尝试使用Spark(pyspark)中的一些代码进行作业.第一次使用这种环境,所以可以肯定我会错过一些东西……
I am trying some code in Spark (pyspark) for an assignment. First time I use this environment, so for sure I miss something…
我有一个名为c_views的简单数据集.
I have a simple dataset called c_views.
如果我跑步
c_views.collect()
我明白了
[…]
(u'ABC', 100),
(u'DEF', 200),
(u'XXX', 50),
(u'XXX', 70)]
[…]
I get
[…]
(u'ABC', 100),
(u'DEF', 200),
(u'XXX', 50),
(u'XXX', 70)]
[…]
我需要实现的是所有单词的总和.所以我的猜测是我应该得到类似的东西:
What I need to achieve is the sum across all words. So my guess is that I should get something like:
(u'ABC', 100),
(u'DEF', 200),
(u'XXX', 120)
(u'ABC', 100),
(u'DEF', 200),
(u'XXX', 120)
所以我想做的是(按照作业中的提示进行操作):
So what I am trying to do is (following the hints in the assignment):
首先,我为输入数据集定义函数sum_views(a,b)
,
然后运行reduceByKey即
first I define the function sum_views(a,b)
for the input dataset,
and then run a reduceByKey i.e.
c_views.reduceByKey(sum_views).collect()
但是我不明白我必须在函数中编写什么代码.我正在尝试很多事情,但总是会出错.工作流程有意义吗?其他简单的方法来获得结果?
However I do not understand what exactly I have to code in the function. I am trying many things but I always get an error. Does the workflow make sense? Other simple ways to achieve the result?
有什么建议吗?预先感谢您的帮助.
Any suggestion? Thank you in advance for your help.
推荐答案
获得结果的其他简单方法?
Other simple ways to achieve the result?
from operator import add
c_views.reduceByKey(add)
或者如果您更喜欢lambda表达式:
or if you prefer lambda expressions:
c_views.reduceByKey(lambda x, y: x + y)
我不明白我必须在函数中编写什么代码
I do not understand what exactly I have to code in the function
它必须是一个函数,该函数接受与RDD中的值相同类型的两个值,并返回与输入相同类型的值.它也必须是关联,这意味着最终结果不能取决于您如何布置括号.
It has to be a function which takes two values of the same types as the values in your RDD and returns a value of the same type as inputs. It also has to be associative which means that the final result cannot depend how do you arrange parentheses.
这篇关于pyspark和reduceByKey:如何做一个简单的求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!