每个循环嵌套两个DataFrame [英] Two DataFrame nested for Each Loop

查看:501
本文介绍了每个循环嵌套两个DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

foreach循环嵌套的DataFrams迭代引发NullPointerException:

The foreach Loop nested iteration of DataFrams throws a NullPointerException:

def nestedDataFrame(leftDF: DataFrame, riteDF: DataFrame): Unit = {    
    val leftCols: Array[String] = leftDF.columns
    val riteCols: Array[String] = riteDF.columns

    leftCols.foreach { ltColName =>
        leftDF.select(ltColName).foreach { ltRow =>
            val leftString = ltRow.apply(0).toString();
            // Works ... But Same Kind Of Code Below
            riteCols.foreach { rtColName =>
              riteDF.select(rtColName).foreach { rtRow => //Exception
              val riteString = rtRow.apply(0).toString();
              print(leftString.equals(riteString)
            }
        }
    }

  }

例外:

java.lang.NullPointerException 在org.apache.spark.sql.Dataset $ .ofRows(Dataset.scala:77) 在org.apache.spark.sql.Dataset.org $ apache $ spark $ sql $ Dataset $$ withPlan(Dataset.scala:3406) 在org.apache.spark.sql.Dataset.select(Dataset.scala:1334) 在org.apache.spark.sql.Dataset.select(Dataset.scala:1352)

java.lang.NullPointerException at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:77) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3406) at org.apache.spark.sql.Dataset.select(Dataset.scala:1334) at org.apache.spark.sql.Dataset.select(Dataset.scala:1352)

可能出了什么问题以及如何解决?

What could be going wrong and how to fix it?

推荐答案

leftDF.select(ltColName).foreach { ltRow =>

以上行将您的代码带入foreach块中,作为执行程序的任务.现在使用riteDF.select(rtColName).foreach { rtRow =>,您正在尝试访问执行器中的Spark会话,这是不允许的. Spark会话仅在驱动程序端可用.在ofRow方法中,它尝试访问sparkSession

The above line brings your code inside the foreach block as a task to executor. Now with riteDF.select(rtColName).foreach { rtRow =>, you are trying to access the Spark session within the executor which is not allowed. The Spark session is only available on the driver end. In the ofRow method, it tries to access sparkSession,

val qe = sparkSession.sessionState.executePlan(logicalPlan)

您不能像常规Java/Scala集合那样使用数据集集合,而应该通过可用的api来使用它们来完成任务,例如,可以加入它们以关联日期.

You can't use dataset collections just like regular Java/Scala collection, you should rather use them by the apis available to accomplish tasks, for example you can join them to correlate date.

在这种情况下,您可以通过多种方式完成比较.您可以加入2个数据集,例如

In this case, you can accomplish the comparison in a number of ways. You can join the 2 datasets, for example,

var joinedDf = leftDF.select(ltColName).join(riteDF.select(rtColName), $"ltColName" === $"rtColName", "inner")

然后分析joinedDf.您甚至可以intersect()这两个数据集.

Then analyze the joinedDf. You can even intersect() the two datasets.

这篇关于每个循环嵌套两个DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆