如何在Spark/Scala中求和数据框的一列的值 [英] How to sum the values of one column of a dataframe in spark/scala

查看：1092 发布时间：2020/9/4 1:05:46 scala apache-spark

本文介绍了如何在Spark/Scala中求和数据框的一列的值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个从CSV文件读取的数据框，其中包含许多列，例如:时间戳，步长，心跳率等.

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.

我想对每一列的值求和，例如步骤"列上的步骤总数.

I want to sum the values of each column, for instance the total number of steps on "steps" column.

据我所知，我想使用以下功能: http://spark. apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

As far as I see I want to use these kind of functions: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

但是我可以理解如何使用函数sum.

But I can understand how to use the function sum.

当我写以下内容时:

val df = CSV.load(args(0))
val sumSteps = df.sum("steps")

函数和无法解析.

我是否错误地使用了函数sum? 我是否需要先使用功能图?如果是的话，怎么办?

Do I use the function sum wrongly? Do Ι need to use first the function map? and if yes how?

一个简单的例子将非常有帮助！我最近开始写Scala.

A simple example would be very helpful! I started writing Scala recently.

推荐答案

如果要sum一列的所有值，则使用DataFrame的内部RDD和reduce效率更高./p>

If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)

//res1 Int = 19

这篇关于如何在Spark/Scala中求和数据框的一列的值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在Spark/Scala中求和数据框的一列的值 [英] How to sum the values of one column of a dataframe in spark/scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在Spark/Scala中求和数据框的一列的值 [英] How to sum the values of one column of a dataframe in spark/scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭