如何在Spark/Scala中求和数据框的一列的值 [英] How to sum the values of one column of a dataframe in spark/scala
问题描述
我有一个从CSV文件读取的数据框,其中包含许多列,例如:时间戳,步长,心跳率等.
I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.
我想对每一列的值求和,例如步骤"列上的步骤总数.
I want to sum the values of each column, for instance the total number of steps on "steps" column.
据我所知,我想使用以下功能: http://spark. apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
As far as I see I want to use these kind of functions: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
但是我可以理解如何使用函数sum.
But I can understand how to use the function sum.
当我写以下内容时:
val df = CSV.load(args(0))
val sumSteps = df.sum("steps")
函数和无法解析.
我是否错误地使用了函数sum? 我是否需要先使用功能图?如果是的话,怎么办?
Do I use the function sum wrongly? Do Ι need to use first the function map? and if yes how?
一个简单的例子将非常有帮助!我最近开始写Scala.
A simple example would be very helpful! I started writing Scala recently.
推荐答案
如果要sum
一列的所有值,则使用DataFrame
的内部RDD
和reduce
效率更高./p>
If you want to sum
all values of one column, it's more efficient to use DataFrame
's internal RDD
and reduce
.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
//res1 Int = 19
这篇关于如何在Spark/Scala中求和数据框的一列的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!