如何使用类型化的数据集将多值列拆分为单独的行? [英] How to split multi-value column into separate rows using typed Dataset?
问题描述
我面临一个如何将多值列(即List[String]
)拆分为单独的行的问题.
I am facing an issue of how to split a multi-value column, i.e. List[String]
, into separate rows.
初始数据集具有以下类型:Dataset[(Integer, String, Double, scala.List[String])]
The initial dataset has following types: Dataset[(Integer, String, Double, scala.List[String])]
+---+--------------------+-------+--------------------+
| id| text | value | properties |
+---+--------------------+-------+--------------------+
| 0|Lorem ipsum dolor...| 1.0|[prp1, prp2, prp3..]|
| 1|Lorem ipsum dolor...| 2.0|[prp4, prp5, prp6..]|
| 2|Lorem ipsum dolor...| 3.0|[prp7, prp8, prp9..]|
结果数据集应具有以下类型:
The resulting dataset should have following types:
Dataset[(Integer, String, Double, String)]
和properties
应该这样分割:
+---+--------------------+-------+--------------------+
| id| text | value | property |
+---+--------------------+-------+--------------------+
| 0|Lorem ipsum dolor...| 1.0| prp1 |
| 0|Lorem ipsum dolor...| 1.0| prp2 |
| 0|Lorem ipsum dolor...| 1.0| prp3 |
| 1|Lorem ipsum dolor...| 2.0| prp4 |
| 1|Lorem ipsum dolor...| 2.0| prp5 |
| 1|Lorem ipsum dolor...| 2.0| prp6 |
经常建议使用
推荐答案
explode
,但是它来自无类型的DataFrame API,并且考虑到您使用数据集,我认为flatMap
运算符可能更合适(请参阅 org.apache.spark.sql数据集).
explode
is often suggested, but it's from the untyped DataFrame API and given you use Dataset, I think flatMap
operator might be a better fit (see org.apache.spark.sql.Dataset).
flatMap[U](func: (T) ⇒ TraversableOnce[U])(implicit arg0: Encoder[U]): Dataset[U]
(特定于标量的)返回一个新的数据集,方法是首先对该数据集的所有元素应用一个函数,然后将结果展平.
(Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
您可以按以下方式使用它:
You could use it as follows:
val ds = Seq(
(0, "Lorem ipsum dolor", 1.0, Array("prp1", "prp2", "prp3")))
.toDF("id", "text", "value", "properties")
.as[(Integer, String, Double, scala.List[String])]
scala> ds.flatMap { t =>
t._4.map { prp =>
(t._1, t._2, t._3, prp) }}.show
+---+-----------------+---+----+
| _1| _2| _3| _4|
+---+-----------------+---+----+
| 0|Lorem ipsum dolor|1.0|prp1|
| 0|Lorem ipsum dolor|1.0|prp2|
| 0|Lorem ipsum dolor|1.0|prp3|
+---+-----------------+---+----+
// or just using for-comprehension
for {
t <- ds
prp <- t._4
} yield (t._1, t._2, t._3, prp)
这篇关于如何使用类型化的数据集将多值列拆分为单独的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!