如何优化的火花此code? [英] How to optimize this code on spark?

查看:179
本文介绍了如何优化的火花此code?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何让这个code Spark中更有效率?

我需要计算最小值,最大值,计数,从数据的意思。

这里是我的样本数据,

How to make this code more efficient in Spark?
I need to calculate minimum, maximum, count, mean from data.
Here is my sample data,

名称店货币

  一个Shop001 99.99

  一个Shop001 87.15

  乙Shop001 3.99

  ...

Name Shop Money
A Shop001 99.99
A Shop001 87.15
B Shop001 3.99
...

现在我尝试整理我的数据来生成平均值,最小值,最大值,按名称+店铺(键)数。

然后得到的结果集()。

这里是我的火花code,

Now I try to organize my data to generate mean, min, max, count by Name+Shop (key).
Then get the result by collect().
Here is my code in spark,

 

会有人提供关于优雅的编码风格的一些建议吗?

谢谢!

Would anyone provide some suggestion about the elegant coding style?
Thanks!

推荐答案

您应该使用 aggregateByKey 的更优化的处理。我们的想法是,你存储状态矢量其中包括数,最小,最大和总和,并使用聚合函数来获取最终值。此外,您还可以使用元组作为重点,这是没有必要来连接键转为一个字符串。

You should use aggregateByKey for more optimal processing. The idea is that you store state vector which consists of count, min, max, and sum, and use aggregation functions to get the final values. Also, you can use tuple as a key, it is not necessary to concatenate keys into a single string.

data = [
        ['x', 'shop1', 1],
        ['x', 'shop1', 2],
        ['x', 'shop2', 3],
        ['x', 'shop2', 4],
        ['x', 'shop3', 5],
        ['y', 'shop4', 6],
        ['y', 'shop4', 7],
        ['y', 'shop4', 8]
    ]

def add(state, x):
    state[0] += 1
    state[1] = min(state[1], x)
    state[2] = max(state[2], x)
    state[3] += x
    return state

def merge(state1, state2):
    state1[0] += state2[0]
    state1[1] = min(state1[1], state2[1])
    state1[2] = max(state1[2], state2[2])
    state1[3] += state2[3]
    return state1

res = sc.parallelize(data).map(lambda x: ((x[0], x[1]), x[2])).aggregateByKey([0, 10000, 0, 0], add, merge)

for x in res.collect():
    print 'Client "%s" shop "%s" : count %d min %f max %f avg %f' % (
        x[0][0], x[0][1],
        x[1][0], x[1][1], x[1][2], float(x[1][3])/float(x[1][0])
    )

这篇关于如何优化的火花此code?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆