在 Apache Spark Java 中获取整个数据集或仅列的摘要 [英] Getting the Summary of Whole Dataset or Only Columns in Apache Spark Java

查看：23 发布时间：2021/11/14 23:33:18 apache-spark apache-spark-sql apache-spark-dataset

本文介绍了在 Apache Spark Java 中获取整个数据集或仅列的摘要的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对于下面的数据集，为了获得 Col1 的总汇总值，我做了

import org.apache.spark.sql.functions._val totaldf = df.groupBy("Col1").agg(lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as(显示价格"))

然后与

合并

df.union(totaldf).orderBy(col("Col1"), col("Col2").desc).show(false)

df.

+-----------+-------+--------+--------------+|列 1 |Col2 |价格 |显示价格 |+-----------+-------+--------+--------------+|类别1 |项目 1 |15 |14 ||类别1 |项目2 |11 |10 ||类别1 |item3 |18 |16 ||类别2 |项目 1 |15 |14 ||类别2 |项目2 |11 |10 ||类别2 |item3 |18 |16 |+-----------+-------+--------+--------------+

合并后.

+-----------+-------+-------+--------------+|列 1 |Col2 |价格 |显示价格 |+-----------+-------+-------+--------------+|类别1 |总计 |44 |40 ||类别1 |项目 1 |15 |14 ||类别1 |项目2 |11 |10 ||类别1 |item3 |18 |16 ||类别2 |总计 |46 |44 ||类别2 |项目 1 |16 |15 ||类别2 |项目2 |11 |10 ||类别2 |item3 |19 |17 |+-----------+-------+-------+--------------+

现在我想要整个数据集的摘要如下，其中将 Col1 摘要作为总计，并具有所有 Col1 和 Col2 的数据.必填.

 +-----------+-------+-------+--------------+|列 1 |Col2 |价格 |显示价格 |+-----------+-------+-------+--------------+|总计 |总计 |90 |84 ||类别1 |总计 |44 |40 ||类别1 |项目 1 |15 |14 ||类别1 |项目2 |11 |10 ||类别1 |item3 |18 |16 ||类别2 |总计 |46 |44 ||类别2 |项目 1 |16 |15 ||类别2 |项目2 |11 |10 ||类别2 |item3 |19 |17 |+-----------+-------+-------+--------------+

我怎样才能达到上述结果?

解决方案

从 totaldf 创建 第三个数据框 as

val finalTotalDF= totaldf.select(lit("Total").as("Col1"), lit("Total").as("Col2"), sum("price").as("价格"), sum("displayPrice").as("displayPrice"))

然后将其用于 union as

df.union(totaldf).union(finalTotalDF).orderBy(col("Col1"), col("Col2").desc).show(false)

你应该有你的最终要求 dataframe

更新

如果订购对您很重要，那么您应该将 Col2 列中 Total 的 T 更改为 t 作为total 通过执行以下操作

import org.apache.spark.sql.functions._val totaldf = df.groupBy("Col1").agg(lit("total").as("Col2"), sum("price").as("price"), sum("displayPrice").as(显示价格"))val finalTotalDF= totaldf.select(lit("Total").as("Col1"), lit("total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"))df.union(totaldf).union(finalTotalDF).orderBy(col("Col1").desc, col("Col2").desc).show(false)

你应该得到

+---------+-----+-----+------------+|Col1 |Col2 |价格|显示价格|+---------+-----+-----+------------+|总计 |总计|90 |82 ||第 2 类|总计|46 |42 ||类别 2|项目 3|19 |17 ||类别 2|项目 2|11 |10 ||类别 2|项目 1|16 |15 ||类别 1|总计|44 |40 ||类别 1|项目 3|18 |16 ||类别 1|项目 2|11 |10 ||类别 1|项目 1|15 |14 |+---------+-----+-----+------------+

如果评论中提到的订购对您来说真的很重要

<块引用><块引用>

我希望总数据优先，所以我希望它位于顶部，这实际上是我的要求

然后您可以创建另一列进行排序

import org.apache.spark.sql.functions._val totaldf = df.groupBy("Col1").agg(lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"), lit(1).as("sort"))val finalTotalDF= totaldf.select(lit("Total").as("Col1"), lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"), lit(0).as("sort"))finalTotalDF.union(totaldf).union(df.withColumn("sort", lit(2))).orderBy(col("sort"), col("Col1"), col("Col2")).drop(排序").显示(假)

你应该得到

+---------+-----+-----+------------+|Col1 |Col2 |价格|显示价格|+---------+-----+-----+------------+|总计 |总计|90 |82 ||第 1 类|总计|44 |40 ||第 2 类|总计|46 |42 ||类别 1|项目 1|15 |14 ||类别 1|项目 2|11 |10 ||类别 1|项目 3|18 |16 ||类别 2|项目 1|16 |15 ||类别 2|项目 2|11 |10 ||类别 2|项目 3|19 |17 |+---------+-----+-----+------------+

For below Dataset, to get Total Summary values of Col1 , I did

import org.apache.spark.sql.functions._
val totaldf = df.groupBy("Col1").agg(lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"))

and then merged with

df.union(totaldf).orderBy(col("Col1"), col("Col2").desc).show(false)

df.

+-----------+-------+--------+--------------+
|   Col1    | Col2  | price  | displayPrice |
+-----------+-------+--------+--------------+
| Category1 | item1 |     15 |           14 |
| Category1 | item2 |     11 |           10 |
| Category1 | item3 |     18 |           16 |
| Category2 | item1 |     15 |           14 |
| Category2 | item2 |     11 |           10 |
| Category2 | item3 |     18 |           16 |
+-----------+-------+--------+--------------+

After merging.

+-----------+-------+-------+--------------+
|   Col1    | Col2  | price | displayPrice |
+-----------+-------+-------+--------------+
| Category1 | Total |    44 |           40 |
| Category1 | item1 |    15 |           14 |
| Category1 | item2 |    11 |           10 |
| Category1 | item3 |    18 |           16 |
| Category2 | Total |    46 |           44 |
| Category2 | item1 |    16 |           15 |
| Category2 | item2 |    11 |           10 |
| Category2 | item3 |    19 |           17 |
+-----------+-------+-------+--------------+

Now I want summary of Whole Dataset as Below , which will have Col1 Summary as Total and has the Data of All Col1 and Col2. Required.

    +-----------+-------+-------+--------------+
    |   Col1    | Col2  | price | displayPrice |
    +-----------+-------+-------+--------------+
    | Total     | Total |    90 |           84 |
    | Category1 | Total |    44 |           40 |
    | Category1 | item1 |    15 |           14 |
    | Category1 | item2 |    11 |           10 |
    | Category1 | item3 |    18 |           16 |
    | Category2 | Total |    46 |           44 |
    | Category2 | item1 |    16 |           15 |
    | Category2 | item2 |    11 |           10 |
    | Category2 | item3 |    19 |           17 |
    +-----------+-------+-------+--------------+

How Can I be able to achieve the above result?

解决方案

create a third dataframe from the totaldf as

val finalTotalDF= totaldf.select(lit("Total").as("Col1"), lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"))

and then use it for union as

df.union(totaldf).union(finalTotalDF).orderBy(col("Col1"), col("Col2").desc).show(false)

You should have your final required dataframe

Updated

If ordering matters to you then you should be changing T of Total in Col2 column to t as total by doing the following

import org.apache.spark.sql.functions._
val totaldf = df.groupBy("Col1").agg(lit("total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"))
val finalTotalDF= totaldf.select(lit("Total").as("Col1"), lit("total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"))
df.union(totaldf).union(finalTotalDF).orderBy(col("Col1").desc, col("Col2").desc).show(false)

and you should get

+---------+-----+-----+------------+
|Col1     |Col2 |price|displayPrice|
+---------+-----+-----+------------+
|Total    |total|90   |82          |
|Category2|total|46   |42          |
|Category2|item3|19   |17          |
|Category2|item2|11   |10          |
|Category2|item1|16   |15          |
|Category1|total|44   |40          |
|Category1|item3|18   |16          |
|Category1|item2|11   |10          |
|Category1|item1|15   |14          |
+---------+-----+-----+------------+

If ordering really matters to you as mentioned in the comment

I want the total Data as prioirity,So I want that to be at the Top, which is actuall the requirement for me

Then you can create another column for sorting as

import org.apache.spark.sql.functions._
val totaldf = df.groupBy("Col1").agg(lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"), lit(1).as("sort"))
val finalTotalDF= totaldf.select(lit("Total").as("Col1"), lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"), lit(0).as("sort"))
finalTotalDF.union(totaldf).union(df.withColumn("sort", lit(2))).orderBy(col("sort"), col("Col1"), col("Col2")).drop("sort").show(false)

and you should get

+---------+-----+-----+------------+
|Col1     |Col2 |price|displayPrice|
+---------+-----+-----+------------+
|Total    |Total|90   |82          |
|Category1|Total|44   |40          |
|Category2|Total|46   |42          |
|Category1|item1|15   |14          |
|Category1|item2|11   |10          |
|Category1|item3|18   |16          |
|Category2|item1|16   |15          |
|Category2|item2|11   |10          |
|Category2|item3|19   |17          |
+---------+-----+-----+------------+

这篇关于在 Apache Spark Java 中获取整个数据集或仅列的摘要的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 Apache Spark Java 中获取整个数据集或仅列的摘要 [英] Getting the Summary of Whole Dataset or Only Columns in Apache Spark Java

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在 Apache Spark Java 中获取整个数据集或仅列的摘要 [英] Getting the Summary of Whole Dataset or Only Columns in Apache Spark Java

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭