Spark:转换为DF后collect(),take()和show()输出之间的差异 [英] Spark: Difference between collect(), take() and show() outputs after conversion toDF

查看:574
本文介绍了Spark:转换为DF后collect(),take()和show()输出之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Spark 1.5.

I am using Spark 1.5.

我有一列30个ID,将从数据库中以integers的形式加载:

I have a column of 30 ids which I am loading as integers from a database:

val numsRDD = sqlContext
     .table(constants.SOURCE_DB + "." + IDS)
     .select("id")
     .distinct
     .map(row=>row.getInt(0))

这是numsRDD的输出:

numsRDD.collect.foreach(println(_))

643761
30673603
30736590
30773400
30832624
31104189
31598495
31723487
32776244
32801792
32879386
32981901
33469224
34213505
34709608
37136455
37260344
37471301
37573190
37578690
37582274
37600896
37608984
37616677
37618105
37644500
37647770
37648497
37720353
37741608

接下来,我想为所有ids生成3的所有组合,然后将每个组合保存为以下形式的元组:< tripletID: String, triplet: Array(Int)>并将其转换为 dataframe ,具体操作如下:

Right next, I want to produce all combinations of 3 for those ids then save each combination as a tuple of the form: < tripletID: String, triplet: Array(Int)> and convert it into a dataframe, which I do as follows:

// |combinationsDF| = 4060 combinations
val combinationsDF = sc
  .parallelize(numsRDD
     .collect
     .combinations(3)
     .toArray
     .map(row => row.sorted)
     .map(row => (
        List(row(0), row(1), row(2)).mkString(","), 
        List(row(0), row(1), row(2)).toArray)))
  .toDF("tripletID","triplet")

一旦我这样做,我就会尝试打印combinationsDF的某些内容,只是为了确保所有内容都应该是应该的样子.所以我试试这个:

As soon as I do that I try to print some of the combinationsDF's contents just to make sure that everything is the way it should be. So I try this:

combinationsDF.show

返回:

+--------------------+--------------------+
|           tripletID|             triplet|
+--------------------+--------------------+
|,37136455,3758227...|[32776244, 371364...|
|,37136455,3761667...|[32776244, 371364...|
|,32776244,3713645...|[31723487, 327762...|
|,37136455,3757869...|[32776244, 371364...|
|,32776244,3713645...|[31598495, 327762...|
|,37136455,3760089...|[32776244, 371364...|
|,37136455,3764849...|[32776244, 371364...|
|,37136455,3764450...|[32776244, 371364...|
|,37136455,3747130...|[32776244, 371364...|
|,32981901,3713645...|[32776244, 329819...|
|,37136455,3761810...|[32776244, 371364...|
|,34213505,3713645...|[32776244, 342135...|
|,37136455,3726034...|[32776244, 371364...|
|,37136455,3772035...|[32776244, 371364...|
|2776244,37136455...|[643761, 32776244...|
|,37136455,3764777...|[32776244, 371364...|
|,37136455,3760898...|[32776244, 371364...|
|,32879386,3713645...|[32776244, 328793...|
|,32776244,3713645...|[31104189, 327762...|
|,32776244,3713645...|[30736590, 327762...|
+--------------------+--------------------+
only showing top 20 rows

很明显,每个tripletID第一个元素都丢失了.因此,请确保100%确保按如下方式使用take(20):

As it is evident, the first element of every tripletID is missing. So, just to be 100% sure I use take(20) as follows:

combinationsDF.take(20).foreach(println(_))

返回如下所示的更详细的表示形式:

which returns a more detailed representation as per below:

[,37136455,37582274,WrappedArray(32776244, 37136455, 37582274)]
[,37136455,37616677,WrappedArray(32776244, 37136455, 37616677)]
[,32776244,37136455,WrappedArray(31723487, 32776244, 37136455)]
[,37136455,37578690,WrappedArray(32776244, 37136455, 37578690)]
[,32776244,37136455,WrappedArray(31598495, 32776244, 37136455)]
[,37136455,37600896,WrappedArray(32776244, 37136455, 37600896)]
[,37136455,37648497,WrappedArray(32776244, 37136455, 37648497)]
[,37136455,37644500,WrappedArray(32776244, 37136455, 37644500)]
[,37136455,37471301,WrappedArray(32776244, 37136455, 37471301)]
[,32981901,37136455,WrappedArray(32776244, 32981901, 37136455)]
[,37136455,37618105,WrappedArray(32776244, 37136455, 37618105)]
[,34213505,37136455,WrappedArray(32776244, 34213505, 37136455)]
[,37136455,37260344,WrappedArray(32776244, 37136455, 37260344)]
[,37136455,37720353,WrappedArray(32776244, 37136455, 37720353)]
[2776244,37136455,WrappedArray(643761, 32776244, 37136455)]
[,37136455,37647770,WrappedArray(32776244, 37136455, 37647770)]
[,37136455,37608984,WrappedArray(32776244, 37136455, 37608984)]
[,32879386,37136455,WrappedArray(32776244, 32879386, 37136455)]
[,32776244,37136455,WrappedArray(31104189, 32776244, 37136455)]
[,32776244,37136455,WrappedArray(30736590, 32776244, 37136455)]

因此,现在我可以确定无论出于何种原因,tripletID中的第一个ID都将以某种方式出现.但是,如果我尝试使用collect而不是take(20):

So now I am sure that the first id from tripletID is somehow for whatever reason deprecated. But still, if I try to use collect instead of take(20):

combinationsDF.collect.foreach(println(_))

一切都恢复了正常(!!!):

everything goes back to being fine again (!!!):

[32776244,37136455,37582274,WrappedArray(32776244, 37136455, 37582274)]
[32776244,37136455,37616677,WrappedArray(32776244, 37136455, 37616677)]
[31723487,32776244,37136455,WrappedArray(31723487, 32776244, 37136455)]
[32776244,37136455,37578690,WrappedArray(32776244, 37136455, 37578690)]
[31598495,32776244,37136455,WrappedArray(31598495, 32776244, 37136455)]
[32776244,37136455,37600896,WrappedArray(32776244, 37136455, 37600896)]
[32776244,37136455,37648497,WrappedArray(32776244, 37136455, 37648497)]
[32776244,37136455,37644500,WrappedArray(32776244, 37136455, 37644500)]
[32776244,37136455,37471301,WrappedArray(32776244, 37136455, 37471301)]
[32776244,32981901,37136455,WrappedArray(32776244, 32981901, 37136455)]
[32776244,37136455,37618105,WrappedArray(32776244, 37136455, 37618105)]
[32776244,34213505,37136455,WrappedArray(32776244, 34213505, 37136455)]
[32776244,37136455,37260344,WrappedArray(32776244, 37136455, 37260344)]
[32776244,37136455,37720353,WrappedArray(32776244, 37136455, 37720353)]
[643761,32776244,37136455,WrappedArray(643761, 32776244, 37136455)]
[32776244,37136455,37647770,WrappedArray(32776244, 37136455, 37647770)]
[32776244,37136455,37608984,WrappedArray(32776244, 37136455, 37608984)]
[32776244,32879386,37136455,WrappedArray(32776244, 32879386, 37136455)]
[31104189,32776244,37136455,WrappedArray(31104189, 32776244, 37136455)]
[30736590,32776244,37136455,WrappedArray(30736590, 32776244, 37136455)]
...

1.在将组合数组parallelize放入RDD中之前,我已经详尽地查询了步骤,一切正常. 2.我还已经在应用parallelize后再次打印输出,然后再次 一切正常. 3.该问题似乎与 numsRDD转换为DF 有关,尽管我已尽力而为,但我仍无法解决. 4.我也无法使用相同的代码片段重现模拟数据的问题.

1. I have exhaustively queried the steps just before I parallelize the array of combinations into an RDD and everything is ok. 2. I have also printed the output right after parallelize is applied and again everything is ok. 3. The problem appears to be related with the conversion of the numsRDD to a DF and despite my best efforts I cannot deal with it. 4. I was also incapable of reproducing the problem with mock data using the same code snippet.

那么首先:是什么引起了这个问题? 第二个:我该如何解决?

So first: What's causing this problem? and second: How do I fix it?

推荐答案

我将检查原始的numsRDD,看来您其中可能有空字符串或空值.这对我有用:

I would check your original numsRDD, it looks like you might have an empty string or null value in there. This works for me:

scala> val numsRDD = sc.parallelize(0 to 30)
numsRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:27

scala> :pa
// Entering paste mode (ctrl-D to finish)

val combinationsDF = sc
  .parallelize(numsRDD
     .collect
     .combinations(3)
     .toArray
     .map(row => row.sorted)
     .map(row => (
        List(row(0), row(1), row(2)).mkString(","),
        List(row(0), row(1), row(2)).toArray)))
  .toDF("tripletID","triplet")

// Exiting paste mode, now interpreting.

combinationsDF: org.apache.spark.sql.DataFrame = [tripletID: string, triplet: array<int>]

scala> combinationsDF.show
+---------+----------+
|tripletID|   triplet|
+---------+----------+
|    0,1,2| [0, 1, 2]|
|    0,1,3| [0, 1, 3]|
|    0,1,4| [0, 1, 4]|
|    0,1,5| [0, 1, 5]|
|    0,1,6| [0, 1, 6]|
|    0,1,7| [0, 1, 7]|
|    0,1,8| [0, 1, 8]|
|    0,1,9| [0, 1, 9]|
|   0,1,10|[0, 1, 10]|
|   0,1,11|[0, 1, 11]|
|   0,1,12|[0, 1, 12]|
|   0,1,13|[0, 1, 13]|
|   0,1,14|[0, 1, 14]|
|   0,1,15|[0, 1, 15]|
|   0,1,16|[0, 1, 16]|
|   0,1,17|[0, 1, 17]|
|   0,1,18|[0, 1, 18]|
|   0,1,19|[0, 1, 19]|
|   0,1,20|[0, 1, 20]|
|   0,1,21|[0, 1, 21]|
+---------+----------+
only showing top 20 rows

我能想到的唯一的另一件事是mkString无法正常工作.试用此字符串插值法(也无需重新创建List):

The only other thing I can think of is mkString not working like you would expect. Try out this string interpolation (also no need to recreate the List):

val combinationsDF = sc
  .parallelize(numsRDD
     .collect
     .combinations(3)
     .toArray
     .map(row => row.sorted)
     .map{case List(a,b,c) => (
        s"$a,$b,$c", 
        Array(a,b,c))}
  .toDF("tripletID","triplet")

scala> combinationsDF.show
+---------+----------+
|tripletID|   triplet|
+---------+----------+
|    0,1,2| [0, 1, 2]|
|    0,1,3| [0, 1, 3]|
|    0,1,4| [0, 1, 4]|
|    0,1,5| [0, 1, 5]|
|    0,1,6| [0, 1, 6]|
|    0,1,7| [0, 1, 7]|
|    0,1,8| [0, 1, 8]|
|    0,1,9| [0, 1, 9]|
|   0,1,10|[0, 1, 10]|
|   0,1,11|[0, 1, 11]|
|   0,1,12|[0, 1, 12]|
|   0,1,13|[0, 1, 13]|
|   0,1,14|[0, 1, 14]|
|   0,1,15|[0, 1, 15]|
|   0,1,16|[0, 1, 16]|
|   0,1,17|[0, 1, 17]|
|   0,1,18|[0, 1, 18]|
|   0,1,19|[0, 1, 19]|
|   0,1,20|[0, 1, 20]|
|   0,1,21|[0, 1, 21]|
+---------+----------+
only showing top 20 rows

这篇关于Spark:转换为DF后collect(),take()和show()输出之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆