使用数据框在Java中的Spark中汇总n列 [英] Summing n columns in Spark in Java using dataframes

查看：60 发布时间：2020/9/4 21:06:44 java apache-spark apache-spark-sql spark-dataframe

本文介绍了使用数据框在Java中的Spark中汇总n列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

String[] col = {"a","b","c"}

数据:

id a b c d e 
101 1 1 1 1 1
102 2 2 2 2 2
103 3 3 3 3 3

预期的输出:-带有在列字符串中指定的列总和的id

Expected output:- id with sum of columns specified in column string

id (a+b+c)
101 3
102 6
103 9

如何使用数据框做到这一点?

How to do this using dataframes?

推荐答案

如果您使用的是java，则可以执行以下操作

if you are using java you can do the following

import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;

static SparkConf conf = new SparkConf().setMaster("local").setAppName("simple");
static SparkContext sc = new SparkContext(conf);
static SQLContext sqlContext = new SQLContext(sc);

public static void main(String[] args) {

    Dataset<Row> df = sqlContext.read()
            .format("com.databricks.spark.csv")
            .option("delimiter", " ")
            .option("header", true)
            .option("inferSchema", true)
            .load("path to the input text file");


    sqlContext.udf().register("sums", (Integer a, Integer b, Integer c) -> a+b+c, DataTypes.IntegerType);
    df.registerTempTable("temp");
    sqlContext.sql("SELECT id, sums(a, b, c) AS `(a+b+c)` FROM temp").show(false);

}

，您应该将输出显示为

+---+-------+
|id |(a+b+c)|
+---+-------+
|101|3      |
|102|6      |
|103|9      |
+---+-------+

如果您希望不使用SQL查询而使用api，则可以执行以下操作

If you prefer to go without sql query and use api then you can do as below

import org.apache.spark.sql.expressions.UserDefinedFunction;
import org.apache.spark.sql.types.DataTypes;
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.udf;

    UserDefinedFunction mode = udf((Integer a, Integer b, Integer c) -> a+b+c, DataTypes.IntegerType);
    df.select(col("id"), mode.apply(col("a"), col("b"), col("c")).as("(a+b+c)")).show(false);

这篇关于使用数据框在Java中的Spark中汇总n列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用数据框在Java中的Spark中汇总n列 [英] Summing n columns in Spark in Java using dataframes

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

使用数据框在Java中的Spark中汇总n列 [英] Summing n columns in Spark in Java using dataframes

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭