在Eclipse Spark Scala调试会话中的RDD内何处查找数据? [英] Where to find data inside a RDD in a eclipse Spark scala debug session?

查看:334
本文介绍了在Eclipse Spark Scala调试会话中的RDD内何处查找数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图调试一个非常简单的Spark Scala字数统计程序。由于spark是懒惰的,因此我认为我需要将断点放在 action语句中,然后运行该行代码,然后就可以在该语句之前检查那些RDD变量并查看它们的数据。因此,我在第14行放置了一个断点,当调试到达那里时,我跳了一步以运行第14行。但是,这样做之后,我无法在调试会话变量视图中查看/查找变量text1,text2的任何数据。(但是我可以在调试视图中的所有变量中看到数据)。我这样做对吗?为什么我看不到text1 / text2变量中的数据?

I tried to debug a very simple Spark scala word count program. Since spark is "lazy" so I think I need to put the break point at an "action" statement and then run that line of code, then I'll be able to check those RDD variables before that statements and look at their data. So I put a break point at line 14, when debugging gets there, I hit step over to run line 14. However after doing that, I cannot see/find any data for varaibles text1, text2 in the debug session variable view.(But I can see data inside the "all" variable in the debug view though). Am I doing this right? Why I cannot see data in the text1/text2 variables ?

假设我的wordCount.txt是这样的:

Suppose my wordCount.txt is like this:


这是一个文本文件,文字为aa aa bb cc cc

This is a text file with words aa aa bb cc cc

我希望看到(aa,2),(bb,1),(cc,2)等在text2变量视图中的某处。但是我在那里找不到任何类似的东西。请参阅代码下方的屏幕截图。

I expect to see (aa,2),(bb,1),(cc,2) etc somewhere in text2 variable view. But I don't find anything like that in there. See screen shot below the codes.

我正在使用Eclipse Neon和Spark2.1,这是一个eclipse本地调试会话。非常感谢您的帮助,因为经过大量搜索后我无法获得任何信息。
这是我的代码:

I am using eclipse Neon and Spark2.1 and it is a eclipse local debug session. Your help would be really appreciated as I cannot get any info after extensive search. Here's my code:

package Big_Data.Spark_App 

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object WordCount {
  def main(args: Array[String]){
    val conf=new SparkConf().setAppName("WordCountApp").setMaster("local")
    val sc = new SparkContext(conf)    
    val text = sc.textFile("/home/cloudera/Downloads/wordCount.txt")
    val text1 = text.flatMap(rec=>rec.split(" ")).map(rec=>(rec,1))
    val text2 = text1.reduceByKey( (v1,v2)=>v1+v2).cache

    val all = text2.collect()  //line 14
    all.foreach(println)           
  }
}

这是调试变量视图,显示text2变量中没有实际数据

推荐答案

Spark不会像您期望的那样评估每个变量,而是构建一个DAG来执行一旦调用了触发器(例如收集),此帖子将对此进行更详细的说明: DAG如何在RDD的幕后工作?本质上,这些中间变量仅存储您创建的链接操作的引用。如果您想检查中间结果,则需要对每个变量调用collect。

Spark does not evaluate each variable as you expect, it builds a DAG that gets executed once a trigger is called (eg collect), this post explains this in more detail: How DAG works under the covers in RDD? Essentially, those intermediate variables only store the reference of the chained operations you created. If you'd like to inspect intermediate results, you'd need to call collect on each variable.

编辑:

上面忘记了,您还可以选择检查Spark操作中的变量。假设您分解这样的映射器:

Forgot to mention above, that you also have the option to inspect variables inside a Spark operation. Say you break down a mapper like this:

val conf=new SparkConf().setAppName("WordCountApp").setMaster("local")
val sc = new SparkContext(conf)
val text = sc.textFile("wordcount.txt")
val text1 = text.flatMap{ rec =>
  val splitStr = rec.split(" ") //can inspect this variable
  splitStr.map(r => (r, 1)) //can inspect variable r
}
val text2 = text1.reduceByKey( (v1,v2)=>v1+v2).cache
val all = text2.collect() 
all.foreach(println)

您可以在映射器中放置一个断点,例如检查 splitStr 每行文本,或在下一行检查每个单词 r

You can put a breakpoint in the mapper, for example to inspect splitStr for each line of text, or in the next line to inspect r for each word.

这篇关于在Eclipse Spark Scala调试会话中的RDD内何处查找数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆