当调用stddev超过1,000列时,SparkSQL作业失败 [英] SparkSQL job fails when calling stddev over 1,000 columns

查看:230
本文介绍了当调用stddev超过1,000列时,SparkSQL作业失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是Spark 2.2.1和Scala 2.11的DataBrick。我正在尝试运行如下所示的SQL查询。

I am on DataBricks with Spark 2.2.1 and Scala 2.11. I am attempting to run a SQL query that looks like the following.

select stddev(col1), stddev(col2), ..., stddev(col1300)
from mydb.mytable

然后我以如下方式执行代码

I then execute the code as follows.

myRdd = sqlContext.sql(sql)

但是,我看到抛出了以下异常。

However, I see the following exception thrown.


Job aborted due to stage failure: Task 24 in stage 16.0 failed 4 times, most recent failure: Lost task 24.3 in stage 16.0 (TID 1946, 10.184.163.105, executor 3): org.codehaus.janino.JaninoRuntimeException: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection has grown past JVM limit of 0xFFFF
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificMutableProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private InternalRow mutableRow;
/* 009 */   private boolean evalExprIsNull;
/* 010 */   private boolean evalExprValue;
/* 011 */   private boolean evalExpr1IsNull;
/* 012 */   private boolean evalExpr1Value;
/* 013 */   private boolean evalExpr2IsNull;
/* 014 */   private boolean evalExpr2Value;
/* 015 */   private boolean evalExpr3IsNull;
/* 016 */   private boolean evalExpr3Value;
/* 017 */   private boolean evalExpr4IsNull;
/* 018 */   private boolean evalExpr4Value;
/* 019 */   private boolean evalExpr5IsNull;
/* 020 */   private boolean evalExpr5Value;
/* 021 */   private boolean evalExpr6IsNull;

stacktrace一直在继续,甚至Databricks笔记本也由于冗长而崩溃。

The stacktrace just goes on and on, and even the Databricks notebook crashes because of the verbosity. Anyone ever seen this?

此外,我有以下2条SQL语句来获取执行的平均值和中位数,而没有任何问题。

Also, I have the following 2 SQL statements to get the average and median that I execute without any problems.

select avg(col1), ..., avg(col1300) from mydb.mytable
select percentile_approx(col1, 0.5), ..., percentile_approx(col1300, 0.5) from mydb.mytable

问题似乎出在 stddev ,但该异常没有帮助。有什么想法吗?还有另一种方法可以轻松地计算出不会导致此问题的标准偏差吗?

The problem seems to be with stddev but the exception is not helpful. Any ideas on what's going on? Is there another way to compute the standard deviation easily that won't lead to this problem?

事实证明,此帖子描述了相同的问题,并说Spark无法处理宽泛的架构或由于64KB大小的类的限制而导致了很多列。但是,如果是这种情况,那么 avg percentile_approx 为什么起作用?

It turns out this post is describing the same problem, saying that Spark cannot handle wide schemas or a lot of columns due to the limitation of 64KB sized classes. However, if that's the case, then why does avg and percentile_approx work?

推荐答案

一些选项:


  • 尝试禁用整个阶段代码世代:

  • Try disabling whole stage code generation:

spark.conf.set("spark.sql.codegen.wholeStage", false)


  • 如果上述方法无助于切换到RDD(从此答案,由 zeo323 ):

    import org.apache.spark.mllib.linalg._
    import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
    
    val columns: Seq[String] = ???
    
    df
      .select(columns map (col(_).cast("double")): _*)
      .rdd
      .map(row => Vectors.dense(columns.map(row.getAs[Double](_)).toArray))
      .aggregate(new MultivariateOnlineSummarizer)(
         (agg, v) => agg.add(v), 
         (agg1, agg2) => agg1.merge(agg2))
    


  • 使用 VectorAssembler 将列组装为单个向量,并使用 Aggregator 类似于使用的< a href = https://stackoverflow.com/a/41927881/9613318>在此处,调整完成方法(您可能需要进行一些其他调整才能转换 ml.linalg.Vectors mllib.linalg.Vectors )。

  • Assemble columns into a single vector, using VectorAssembler and use Aggregator, similar to the one used here, adjusting finish method (you might need some additional tweaking to convert ml.linalg.Vectors to mllib.linalg.Vectors).


    但是,如果是这种情况,那么avg和percentile_approx为什么起作用?

    However, if that's the case, then why does avg and percentile_approx work?

    Spark实际上为这些阶段生成Java代码。由于逻辑不同,因此输出大小会有所不同。

    Spark literally generates Java code for these stages. Because logic is not the same, output size will differ.

    这篇关于当调用stddev超过1,000列时,SparkSQL作业失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆