如何使用“立方体"仅适用于 Spark 数据帧上的特定字段? [英] How to use "cube" only for specific fields on Spark dataframe?

查看:17
本文介绍了如何使用“立方体"仅适用于 Spark 数据帧上的特定字段?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是 Spark 1.6.1,我有这样一个数据框.

I'm using Spark 1.6.1, and I have such a dataframe.

+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+
|     scene_id|  action_id|       classifier|os_name|country|app_ver|   p0value|p1value|p2value|p3value|p4value|
+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+
|    test_home|scene_enter|        test_home|android|     KR|  5.6.3|__OTHERS__|  false|   test|   test|   test|
......

我想通过使用立方体操作获得如下数据帧.

And I want to get dataframe like as following by using cube operation.

(按所有字段分组,但只有os_name"、country"、app_ver"字段是立方的)

(Grouped by all fields, but only "os_name", "country", "app_ver" fields are cubed)

+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+---+
|     scene_id|  action_id|       classifier|os_name|country|app_ver|   p0value|p1value|p2value|p3value|p4value|cnt|
+-------------+-----------+-----------------+-------+-------+-------+----------+-------+-------+-------+-------+---+
|    test_home|scene_enter|        test_home|android|     KR|  5.6.3|__OTHERS__|  false|   test|   test|   test|  9|
|    test_home|scene_enter|        test_home|   null|     KR|  5.6.3|__OTHERS__|  false|   test|   test|   test| 35|
|    test_home|scene_enter|        test_home|android|   null|  5.6.3|__OTHERS__|  false|   test|   test|   test| 98|
|    test_home|scene_enter|        test_home|android|     KR|   null|__OTHERS__|  false|   test|   test|   test|101|
|    test_home|scene_enter|        test_home|   null|   null|  5.6.3|__OTHERS__|  false|   test|   test|   test|301|
|    test_home|scene_enter|        test_home|   null|     KR|   null|__OTHERS__|  false|   test|   test|   test|225|
|    test_home|scene_enter|        test_home|android|   null|   null|__OTHERS__|  false|   test|   test|   test|312|
|    test_home|scene_enter|        test_home|   null|   null|   null|__OTHERS__|  false|   test|   test|   test|521|
......

我尝试过如下所示,但它似乎又慢又丑..

I have tried like below, but it seems to be slow and ugly..

var cubed = df
  .cube($"scene_id", $"action_id", $"classifier", $"country", $"os_name", $"app_ver", $"p0value", $"p1value", $"p2value", $"p3value", $"p4value")
  .count
  .where("scene_id IS NOT NULL AND action_id IS NOT NULL AND classifier IS NOT NULL AND p0value IS NOT NULL AND p1value IS NOT NULL AND p2value IS NOT NULL AND p3value IS NOT NULL AND p4value IS NOT NULL")

有更好的解决方案吗?

推荐答案

我相信你无法完全避免这个问题,但有一个简单的技巧可以减少它的规模.这个想法是用一个占位符替换所有不应被边缘化的列.

I believe you cannot avoid the problem completely but there is a simple trick you can reduce its scale. The idea is to replace all columns, which shouldn't be marginalized, with a single placeholder.

例如,如果您有一个 DataFrame:

For example if you have a DataFrame:

val df = Seq((1, 2, 3, 4, 5, 6)).toDF("a", "b", "c", "d", "e", "f")

并且您对由 de 边缘化并由 a..c 分组的立方体感兴趣,您可以定义 de 的替代品代码>a..c 为:

and you're interested in cube marginalized by d and e and grouped by a..c you can define the substitute for a..c as:

import org.apache.spark.sql.functions.struct
import sparkSql.implicits._

// alias here may not work in Spark 1.6
val rest = struct(Seq($"a", $"b", $"c"): _*).alias("rest")

cube:

val cubed =  Seq($"d", $"e")

// If there is a problem with aliasing rest it can done here.
val tmp = df.cube(rest.alias("rest") +: cubed: _*).count

快速过滤和选择应该处理其余的:

Quick filter and select should handle the rest:

tmp.where($"rest".isNotNull).select($"rest.*" +: cubed :+ $"count": _*)

结果如下:

+---+---+---+----+----+-----+
|  a|  b|  c|   d|   e|count|
+---+---+---+----+----+-----+
|  1|  2|  3|null|   5|    1|
|  1|  2|  3|null|null|    1|
|  1|  2|  3|   4|   5|    1|
|  1|  2|  3|   4|null|    1|
+---+---+---+----+----+-----+

这篇关于如何使用“立方体"仅适用于 Spark 数据帧上的特定字段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆