Spark DataFrame:查找并设置子项的主根 [英] Spark DataFrame: find and set the main root for child

查看:17
本文介绍了Spark DataFrame:查找并设置子项的主根的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下 Apache Spark 数据框:

I have the following Apache Spark Dataframe:

-
A1 - A10
A1 - A2
A2 - A3
A3 - A4
A5 - A7
A7 - A6
A8 - A9

Parent - Child
A1 - A10
A1 - A2
A2 - A3
A3 - A4
A5 - A7
A7 - A6
A8 - A9

此 DataFrame 显示父子之间的连接.从逻辑上看,它看起来像这样:

This DataFrame displays a connection between parent and child. Logically it looks like this one:

主要目标是为每个孩子设置主根.这意味着我们应该有以下数据框:

The main goal is setting the main root for each child. That's mean we should have the follwoing dataframe:

-
A1 - A10
A1 - A2
A1 - A3
A1 - A4
A5 - A7
A5 - A6
A8 - A9

Parent - Child
A1 - A10
A1 - A2
A1 - A3
A1 - A4
A5 - A7
A5 - A6
A8 - A9


  • 一切都应该使用 Apache Spark 实现.
  • 节点数量没有限制.这意味着无论节点数量如何,算法都应该有效

推荐答案

我相信你可以通过以下方法实现

With the below approach I believe you can achieve it

val input_rdd = spark.sparkContext.parallelize(List(("A1", "A10"), ("A1", "A2"), ("A2", "A3"), ("A3", "A4"), ("A5", "A7"), ("A7", "A6"), ("A8", "A9"), ("A4", "A11"), ("A11", "A12"), ("A6", "A13")))
val input_df = input_rdd.toDF("Parent", "Child")
input_df.createOrReplaceTempView("TABLE1")
input_df.show()

输入

+------+-----+
|Parent|Child|
+------+-----+
|    A1|  A10|
|    A1|   A2|
|    A2|   A3|
|    A3|   A4|
|    A5|   A7|
|    A7|   A6|
|    A8|   A9|
|    A4|  A11|
|   A11|  A12|
|    A6|  A13|
+------+-----+

# # linkchild function to get the root    
      def linkchild(df: DataFrame): DataFrame = {
    df.createOrReplaceTempView("TEMP")
    val link_child_df = spark.sql("""select distinct a.parent, b.child from TEMP a inner join TEMP b on a.parent = b.parent or a.child = b.parent""")
    link_child_df
    }
# # findroot function to validate and generate output
    def findroot(rdf: DataFrame) {
      val link_child_df = linkchild(rdf)
      link_child_df.createOrReplaceTempView("TEMP1")
      val cnt = spark.sql("""select * from table1 where child not in (select  child from (select * from (select distinct a.parent, b.child from TEMP1 a   inner join TEMP1 b on a.parent = b.parent or a.child = b.parent
    where a.parent not in(select distinct child from                                                                               TABLE1))))""").count()
      if (cnt == 0) {
        spark.sql("""select * from (select distinct a.parent, b.child from   TEMP1 a inner join TEMP1 b on a.parent = b.parent or a.child = b.parent
    where a.parent not in(select distinct child from TABLE1)) order by parent, child""").show
      } else {
        findroot(link_child_df)
      }
    }
# # Calling findroot function for the first time with input_df which in turn calls linkchild function till it reaches target
    findroot(input_df)

输出

+------+-----+
|parent|child|
+------+-----+
|    A1|  A10|
|    A1|  A11|
|    A1|  A12|
|    A1|  A14|
|    A1|  A15|
|    A1|  A16|
|    A1|  A17|
|    A1|  A18|
|    A1|   A2|
|    A1|   A3|
|    A1|   A4|
|    A5|  A13|
|    A5|   A6|
|    A5|   A7|
|    A8|   A9|
+------+-----+

这篇关于Spark DataFrame:查找并设置子项的主根的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆