如何在Spark/Scala中求和数据框的一列的值 [英] How to sum the values of one column of a dataframe in spark/scala

查看:1092
本文介绍了如何在Spark/Scala中求和数据框的一列的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个从CSV文件读取的数据框,其中包含许多列,例如:时间戳,步长,心跳率等.

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.

我想对每一列的值求和,例如步骤"列上的步骤总数.

I want to sum the values of each column, for instance the total number of steps on "steps" column.

据我所知,我想使用以下功能: http://spark. apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

As far as I see I want to use these kind of functions: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

但是我可以理解如何使用函数sum.

But I can understand how to use the function sum.

当我写以下内容时:

val df = CSV.load(args(0))
val sumSteps = df.sum("steps") 

函数和无法解析.

我是否错误地使用了函数sum? 我是否需要先使用功能图?如果是的话,怎么办?

Do I use the function sum wrongly? Do Ι need to use first the function map? and if yes how?

一个简单的例子将非常有帮助!我最近开始写Scala.

A simple example would be very helpful! I started writing Scala recently.

推荐答案

如果要sum一列的所有值,则使用DataFrame的内部RDDreduce效率更高./p>

If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)

//res1 Int = 19

这篇关于如何在Spark/Scala中求和数据框的一列的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆