每个循环嵌套两个 DataFrame [英] Two DataFrame nested for Each Loop

查看:31
本文介绍了每个循环嵌套两个 DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

DataFrams 的 foreach 循环嵌套迭代抛出 NullPointerException:

The foreach Loop nested iteration of DataFrams throws a NullPointerException:

def nestedDataFrame(leftDF: DataFrame, riteDF: DataFrame): Unit = {    
    val leftCols: Array[String] = leftDF.columns
    val riteCols: Array[String] = riteDF.columns

    leftCols.foreach { ltColName =>
        leftDF.select(ltColName).foreach { ltRow =>
            val leftString = ltRow.apply(0).toString();
            // Works ... But Same Kind Of Code Below
            riteCols.foreach { rtColName =>
              riteDF.select(rtColName).foreach { rtRow => //Exception
              val riteString = rtRow.apply(0).toString();
              print(leftString.equals(riteString)
            }
        }
    }

  }

例外:

java.lang.NullPointerException在 org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:77)在 org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3406)在 org.apache.spark.sql.Dataset.select(Dataset.scala:1334)在 org.apache.spark.sql.Dataset.select(Dataset.scala:1352)

java.lang.NullPointerException at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:77) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3406) at org.apache.spark.sql.Dataset.select(Dataset.scala:1334) at org.apache.spark.sql.Dataset.select(Dataset.scala:1352)

可能出了什么问题以及如何解决?

What could be going wrong and how to fix it?

推荐答案

leftDF.select(ltColName).foreach { ltRow =>

以上行将您的代码作为执行程序的任务放入 foreach 块中.现在使用 riteDF.select(rtColName).foreach { rtRow =>,您正在尝试访问不允许的执行程序中的 Spark 会话.Spark 会话仅在驱动程序端可用.在 ofRow 方法中,它尝试访问 sparkSession,

The above line brings your code inside the foreach block as a task to executor. Now with riteDF.select(rtColName).foreach { rtRow =>, you are trying to access the Spark session within the executor which is not allowed. The Spark session is only available on the driver end. In the ofRow method, it tries to access sparkSession,

val qe = sparkSession.sessionState.executePlan(logicalPlan)

您不能像使用常规 Java/Scala 集合一样使用数据集集合,您应该通过可用于完成任务的 api 使用它们,例如您可以加入它们以关联日期.

You can't use dataset collections just like regular Java/Scala collection, you should rather use them by the apis available to accomplish tasks, for example you can join them to correlate date.

在这种情况下,您可以通过多种方式完成比较.您可以加入2个数据集,例如,

In this case, you can accomplish the comparison in a number of ways. You can join the 2 datasets, for example,

var joinedDf = leftDF.select(ltColName).join(riteDF.select(rtColName), $"ltColName" === $"rtColName", "inner")

然后分析joinedDf.你甚至可以intersect()这两个数据集.

Then analyze the joinedDf. You can even intersect() the two datasets.

这篇关于每个循环嵌套两个 DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆