如何使用类型化的数据集将多值列拆分为单独的行? [英] How to split multi-value column into separate rows using typed Dataset?

查看:115
本文介绍了如何使用类型化的数据集将多值列拆分为单独的行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我面临一个如何将多值列(即List[String])拆分为单独的行的问题.

I am facing an issue of how to split a multi-value column, i.e. List[String], into separate rows.

初始数据集具有以下类型:Dataset[(Integer, String, Double, scala.List[String])]

The initial dataset has following types: Dataset[(Integer, String, Double, scala.List[String])]

+---+--------------------+-------+--------------------+
| id|       text         | value |    properties      |
+---+--------------------+-------+--------------------+
|  0|Lorem ipsum dolor...|    1.0|[prp1, prp2, prp3..]|
|  1|Lorem ipsum dolor...|    2.0|[prp4, prp5, prp6..]|
|  2|Lorem ipsum dolor...|    3.0|[prp7, prp8, prp9..]|

结果数据集应具有以下类型:

The resulting dataset should have following types:

Dataset[(Integer, String, Double, String)]

properties应该这样分割:

+---+--------------------+-------+--------------------+
| id|       text         | value |    property        |
+---+--------------------+-------+--------------------+
|  0|Lorem ipsum dolor...|    1.0|        prp1        |
|  0|Lorem ipsum dolor...|    1.0|        prp2        |
|  0|Lorem ipsum dolor...|    1.0|        prp3        |
|  1|Lorem ipsum dolor...|    2.0|        prp4        |
|  1|Lorem ipsum dolor...|    2.0|        prp5        |
|  1|Lorem ipsum dolor...|    2.0|        prp6        |

经常建议使用

推荐答案

explode,但是它来自无类型的DataFrame API,并且考虑到您使用数据集,我认为flatMap运算符可能更合适(请参阅 org.apache.spark.sql数据集).

explode is often suggested, but it's from the untyped DataFrame API and given you use Dataset, I think flatMap operator might be a better fit (see org.apache.spark.sql.Dataset).

flatMap[U](func: (T) ⇒ TraversableOnce[U])(implicit arg0: Encoder[U]): Dataset[U]

(特定于标量的)返回一个新的数据集,方法是首先对该数据集的所有元素应用一个函数,然后将结果展平.

(Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

您可以按以下方式使用它:

You could use it as follows:

val ds = Seq(
  (0, "Lorem ipsum dolor", 1.0, Array("prp1", "prp2", "prp3")))
  .toDF("id", "text", "value", "properties")
  .as[(Integer, String, Double, scala.List[String])]

scala> ds.flatMap { t => 
  t._4.map { prp => 
    (t._1, t._2, t._3, prp) }}.show
+---+-----------------+---+----+
| _1|               _2| _3|  _4|
+---+-----------------+---+----+
|  0|Lorem ipsum dolor|1.0|prp1|
|  0|Lorem ipsum dolor|1.0|prp2|
|  0|Lorem ipsum dolor|1.0|prp3|
+---+-----------------+---+----+

// or just using for-comprehension
for {
  t <- ds
  prp <- t._4
} yield (t._1, t._2, t._3, prp)

这篇关于如何使用类型化的数据集将多值列拆分为单独的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆