按照 Spark Scala 中的以下逻辑在 spark 中生成 ID [英] Generate ID in spark as per the below logic in Spark Scala
问题描述
我有一个包含 parent_id,service_id,product_relation_id,product_name 字段的数据框,如下所示,我想分配 id 字段,如下表所示,请注意
I have a dataframe having parent_id,service_id,product_relation_id,product_name field as given below, I want to assign id field as shown in the table below, Please note that
一个 parent_id 有多个 service_id
one parent_id has many service_id
一个 service_id 有多个 product_name
one service_id has many product_name
ID 生成应遵循以下模式
ID generation should follow the below pattern
父级 -- 1.n孩子 1 -- 1.n.1孩子 2 -- 1.n.2儿童 3 -- 1.n.3孩子 4 -- 1.n.4
Parent -- 1.n Child 1 -- 1.n.1 Child 2 -- 1.n.2 Child 3 -- 1.n.3 Child 4 -- 1.n.4
我们如何以一种同时考虑大数据性能的方式实现此逻辑?
How do we implement this logic in a manner that considering performance as well on Big Data ?
推荐答案
Scala 实现
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val parentWindowSpec = Window.orderBy("parent_id")
val childWindowSpec = Window.partitionBy(
"parent_version", "service_id"
).orderBy("product_relation_id")
val df = spark.read.options(
Map("inferSchema"->"true","delimiter"->",","header"->"true")
).csv("product.csv")
val df2 = df.withColumn(
"parent_version", dense_rank.over(parentWindowSpec)
).withColumn(
"child_version",row_number.over(childWindowSpec) - 1)
val df3 = df2.withColumn("id",
when(col("product_name") === lit("Parent"),
concat(lit("1."), col("parent_version")))
.otherwise(concat(lit("1."), col("parent_version"),lit("."),col("child_version")))
).drop("parent_version").drop("child_version")
输出:
scala> df3.show
21/03/26 11:55:01 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+---------+----------+-------------------+------------+-----+
|parent_id|service_id|product_relation_id|product_name| id|
+---------+----------+-------------------+------------+-----+
| 100| 1| 1-A| Parent| 1.1|
| 100| 1| 1-A| Child1|1.1.1|
| 100| 1| 1-A| Child2|1.1.2|
| 100| 1| 1-A| Child3|1.1.3|
| 100| 1| 1-A| Child4|1.1.4|
| 100| 2| 1-B| Parent| 1.1|
| 100| 2| 1-B| Child1|1.1.1|
| 100| 2| 1-B| Child2|1.1.2|
| 100| 2| 1-B| Child3|1.1.3|
| 100| 2| 1-B| Child4|1.1.4|
| 100| 3| 1-C| Parent| 1.1|
| 100| 3| 1-C| Child1|1.1.1|
| 100| 3| 1-C| Child2|1.1.2|
| 100| 3| 1-C| Child3|1.1.3|
| 100| 3| 1-C| Child4|1.1.4|
| 200| 5| 1-D| Parent| 1.2|
| 200| 5| 1-D| Child1|1.2.1|
| 200| 5| 1-D| Child2|1.2.2|
| 200| 5| 1-D| Child3|1.2.3|
| 200| 5| 1-D| Child4|1.2.4|
+---------+----------+-------------------+------------+-----+
only showing top 20 rows
这篇关于按照 Spark Scala 中的以下逻辑在 spark 中生成 ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!