pyspark和reduceByKey:如何做一个简单的求和 [英] pyspark and reduceByKey: how to make a simple sum

查看:724
本文介绍了pyspark和reduceByKey:如何做一个简单的求和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Spark(pyspark)中的一些代码进行作业.第一次使用这种环境,所以可以肯定我会错过一些东西……

I am trying some code in Spark (pyspark) for an assignment. First time I use this environment, so for sure I miss something…

我有一个名为c_views的简单数据集.

I have a simple dataset called c_views.

如果我跑步 c_views.collect()

我明白了
[…] (u'ABC', 100), (u'DEF', 200), (u'XXX', 50), (u'XXX', 70)] […]

I get
[…] (u'ABC', 100), (u'DEF', 200), (u'XXX', 50), (u'XXX', 70)] […]

我需要实现的是所有单词的总和.所以我的猜测是我应该得到类似的东西:

What I need to achieve is the sum across all words. So my guess is that I should get something like:

(u'ABC', 100), (u'DEF', 200), (u'XXX', 120)

(u'ABC', 100), (u'DEF', 200), (u'XXX', 120)

所以我想做的是(按照作业中的提示进行操作):

So what I am trying to do is (following the hints in the assignment):

首先,我为输入数据集定义函数sum_views(a,b), 然后运行reduceByKey即

first I define the function sum_views(a,b) for the input dataset, and then run a reduceByKey i.e.

c_views.reduceByKey(sum_views).collect()

但是我不明白我必须在函数中编写什么代码.我正在尝试很多事情,但总是会出错.工作流程有意义吗?其他简单的方法来获得结果?

However I do not understand what exactly I have to code in the function. I am trying many things but I always get an error. Does the workflow make sense? Other simple ways to achieve the result?

有什么建议吗?预先感谢您的帮助.

Any suggestion? Thank you in advance for your help.

推荐答案

获得结果的其他简单方法?

Other simple ways to achieve the result?

from operator import add 

c_views.reduceByKey(add)

或者如果您更喜欢lambda表达式:

or if you prefer lambda expressions:

c_views.reduceByKey(lambda x, y: x + y)

我不明白我必须在函数中编写什么代码

I do not understand what exactly I have to code in the function

它必须是一个函数,该函数接受与RDD中的值相同类型的两个值,并返回与输入相同类型的值.它也必须是关联,这意味着最终结果不能取决于您如何布置括号.

It has to be a function which takes two values of the same types as the values in your RDD and returns a value of the same type as inputs. It also has to be associative which means that the final result cannot depend how do you arrange parentheses.

这篇关于pyspark和reduceByKey:如何做一个简单的求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆