在 Apache Spark Java 中获取整个数据集或仅列的摘要 [英] Getting the Summary of Whole Dataset or Only Columns in Apache Spark Java
问题描述
对于下面的数据集,为了获得 Col1 的总汇总值,我做了
import org.apache.spark.sql.functions._val totaldf = df.groupBy("Col1").agg(lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as(显示价格"))
然后与
合并df.union(totaldf).orderBy(col("Col1"), col("Col2").desc).show(false)
df.
+-----------+-------+--------+--------------+|列 1 |Col2 |价格 |显示价格 |+-----------+-------+--------+--------------+|类别1 |项目 1 |15 |14 ||类别1 |项目2 |11 |10 ||类别1 |item3 |18 |16 ||类别2 |项目 1 |15 |14 ||类别2 |项目2 |11 |10 ||类别2 |item3 |18 |16 |+-----------+-------+--------+--------------+
合并后.
+-----------+-------+-------+--------------+|列 1 |Col2 |价格 |显示价格 |+-----------+-------+-------+--------------+|类别1 |总计 |44 |40 ||类别1 |项目 1 |15 |14 ||类别1 |项目2 |11 |10 ||类别1 |item3 |18 |16 ||类别2 |总计 |46 |44 ||类别2 |项目 1 |16 |15 ||类别2 |项目2 |11 |10 ||类别2 |item3 |19 |17 |+-----------+-------+-------+--------------+
现在我想要整个数据集的摘要如下,其中将 Col1 摘要作为总计,并具有所有 Col1 和 Col2 的数据.必填.
+-----------+-------+-------+--------------+|列 1 |Col2 |价格 |显示价格 |+-----------+-------+-------+--------------+|总计 |总计 |90 |84 ||类别1 |总计 |44 |40 ||类别1 |项目 1 |15 |14 ||类别1 |项目2 |11 |10 ||类别1 |item3 |18 |16 ||类别2 |总计 |46 |44 ||类别2 |项目 1 |16 |15 ||类别2 |项目2 |11 |10 ||类别2 |item3 |19 |17 |+-----------+-------+-------+--------------+
我怎样才能达到上述结果?
从 totaldf
创建 第三个数据框 as
val finalTotalDF= totaldf.select(lit("Total").as("Col1"), lit("Total").as("Col2"), sum("price").as("价格"), sum("displayPrice").as("displayPrice"))
然后将其用于 union
as
df.union(totaldf).union(finalTotalDF).orderBy(col("Col1"), col("Col2").desc).show(false)
你应该有你的最终要求 dataframe
更新
如果订购对您很重要,那么您应该将 Col2
列中 Total
的 T
更改为 t
作为total
通过执行以下操作
import org.apache.spark.sql.functions._val totaldf = df.groupBy("Col1").agg(lit("total").as("Col2"), sum("price").as("price"), sum("displayPrice").as(显示价格"))val finalTotalDF= totaldf.select(lit("Total").as("Col1"), lit("total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"))df.union(totaldf).union(finalTotalDF).orderBy(col("Col1").desc, col("Col2").desc).show(false)
你应该得到
+---------+-----+-----+------------+|Col1 |Col2 |价格|显示价格|+---------+-----+-----+------------+|总计 |总计|90 |82 ||第 2 类|总计|46 |42 ||类别 2|项目 3|19 |17 ||类别 2|项目 2|11 |10 ||类别 2|项目 1|16 |15 ||类别 1|总计|44 |40 ||类别 1|项目 3|18 |16 ||类别 1|项目 2|11 |10 ||类别 1|项目 1|15 |14 |+---------+-----+-----+------------+
如果评论中提到的订购对您来说真的很重要
<块引用><块引用>我希望总数据优先,所以我希望它位于顶部,这实际上是我的要求
然后您可以创建另一列进行排序
import org.apache.spark.sql.functions._val totaldf = df.groupBy("Col1").agg(lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"), lit(1).as("sort"))val finalTotalDF= totaldf.select(lit("Total").as("Col1"), lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"), lit(0).as("sort"))finalTotalDF.union(totaldf).union(df.withColumn("sort", lit(2))).orderBy(col("sort"), col("Col1"), col("Col2")).drop(排序").显示(假)
你应该得到
+---------+-----+-----+------------+|Col1 |Col2 |价格|显示价格|+---------+-----+-----+------------+|总计 |总计|90 |82 ||第 1 类|总计|44 |40 ||第 2 类|总计|46 |42 ||类别 1|项目 1|15 |14 ||类别 1|项目 2|11 |10 ||类别 1|项目 3|18 |16 ||类别 2|项目 1|16 |15 ||类别 2|项目 2|11 |10 ||类别 2|项目 3|19 |17 |+---------+-----+-----+------------+
For below Dataset, to get Total Summary values of Col1 , I did
import org.apache.spark.sql.functions._
val totaldf = df.groupBy("Col1").agg(lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"))
and then merged with
df.union(totaldf).orderBy(col("Col1"), col("Col2").desc).show(false)
df.
+-----------+-------+--------+--------------+
| Col1 | Col2 | price | displayPrice |
+-----------+-------+--------+--------------+
| Category1 | item1 | 15 | 14 |
| Category1 | item2 | 11 | 10 |
| Category1 | item3 | 18 | 16 |
| Category2 | item1 | 15 | 14 |
| Category2 | item2 | 11 | 10 |
| Category2 | item3 | 18 | 16 |
+-----------+-------+--------+--------------+
After merging.
+-----------+-------+-------+--------------+
| Col1 | Col2 | price | displayPrice |
+-----------+-------+-------+--------------+
| Category1 | Total | 44 | 40 |
| Category1 | item1 | 15 | 14 |
| Category1 | item2 | 11 | 10 |
| Category1 | item3 | 18 | 16 |
| Category2 | Total | 46 | 44 |
| Category2 | item1 | 16 | 15 |
| Category2 | item2 | 11 | 10 |
| Category2 | item3 | 19 | 17 |
+-----------+-------+-------+--------------+
Now I want summary of Whole Dataset as Below , which will have Col1 Summary as Total and has the Data of All Col1 and Col2. Required.
+-----------+-------+-------+--------------+
| Col1 | Col2 | price | displayPrice |
+-----------+-------+-------+--------------+
| Total | Total | 90 | 84 |
| Category1 | Total | 44 | 40 |
| Category1 | item1 | 15 | 14 |
| Category1 | item2 | 11 | 10 |
| Category1 | item3 | 18 | 16 |
| Category2 | Total | 46 | 44 |
| Category2 | item1 | 16 | 15 |
| Category2 | item2 | 11 | 10 |
| Category2 | item3 | 19 | 17 |
+-----------+-------+-------+--------------+
How Can I be able to achieve the above result?
create a third dataframe from the totaldf
as
val finalTotalDF= totaldf.select(lit("Total").as("Col1"), lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"))
and then use it for union
as
df.union(totaldf).union(finalTotalDF).orderBy(col("Col1"), col("Col2").desc).show(false)
You should have your final required dataframe
Updated
If ordering matters to you then you should be changing T
of Total
in Col2
column to t
as total
by doing the following
import org.apache.spark.sql.functions._
val totaldf = df.groupBy("Col1").agg(lit("total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"))
val finalTotalDF= totaldf.select(lit("Total").as("Col1"), lit("total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"))
df.union(totaldf).union(finalTotalDF).orderBy(col("Col1").desc, col("Col2").desc).show(false)
and you should get
+---------+-----+-----+------------+
|Col1 |Col2 |price|displayPrice|
+---------+-----+-----+------------+
|Total |total|90 |82 |
|Category2|total|46 |42 |
|Category2|item3|19 |17 |
|Category2|item2|11 |10 |
|Category2|item1|16 |15 |
|Category1|total|44 |40 |
|Category1|item3|18 |16 |
|Category1|item2|11 |10 |
|Category1|item1|15 |14 |
+---------+-----+-----+------------+
If ordering really matters to you as mentioned in the comment
I want the total Data as prioirity,So I want that to be at the Top, which is actuall the requirement for me
Then you can create another column for sorting as
import org.apache.spark.sql.functions._
val totaldf = df.groupBy("Col1").agg(lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"), lit(1).as("sort"))
val finalTotalDF= totaldf.select(lit("Total").as("Col1"), lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"), lit(0).as("sort"))
finalTotalDF.union(totaldf).union(df.withColumn("sort", lit(2))).orderBy(col("sort"), col("Col1"), col("Col2")).drop("sort").show(false)
and you should get
+---------+-----+-----+------------+
|Col1 |Col2 |price|displayPrice|
+---------+-----+-----+------------+
|Total |Total|90 |82 |
|Category1|Total|44 |40 |
|Category2|Total|46 |42 |
|Category1|item1|15 |14 |
|Category1|item2|11 |10 |
|Category1|item3|18 |16 |
|Category2|item1|16 |15 |
|Category2|item2|11 |10 |
|Category2|item3|19 |17 |
+---------+-----+-----+------------+
这篇关于在 Apache Spark Java 中获取整个数据集或仅列的摘要的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!