Spark ML 管道导致 java.lang.Exception: failed to compile ... Code ... 增长超过 64 KB [英] Spark ML Pipeline Causes java.lang.Exception: failed to compile ... Code ... grows beyond 64 KB

查看：36 发布时间：2021/11/14 22:08:52 python apache-spark pyspark apache-spark-sql pyspark-sql

本文介绍了Spark ML 管道导致 java.lang.Exception: failed to compile ... Code ... 增长超过 64 KB的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用 Spark 2.0，我试图在 pyspark ML 管道中运行一个简单的 VectorAssembler，如下所示:

Using Spark 2.0, I am trying to run a simple VectorAssembler in a pyspark ML pipeline, like so:

feature_assembler = VectorAssembler(inputCols=['category_count', 'name_count'], \
                                    outputCol="features") 
pipeline = Pipeline(stages=[feature_assembler])
model = pipeline.fit(df_train)
model_output = model.transform(df_train)

当我尝试使用

model_output.select("features").show(1)

我收到错误

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-95-7a3e3d4f281c> in <module>()
      2 
      3 
----> 4 model_output.select("features").show(1)

/usr/local/spark20/python/pyspark/sql/dataframe.pyc in show(self, n, truncate)
    285         +---+-----+
    286         """
--> 287         print(self._jdf.showString(n, truncate))
    288 
    289     def __repr__(self):

/usr/local/spark20/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    931         answer = self.gateway_client.send_command(command)
    932         return_value = get_return_value(
--> 933             answer, self.gateway_client, self.target_id, self.name)
    934 
    935         for temp_arg in temp_args:

/usr/local/spark20/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/usr/local/spark20/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    310                 raise Py4JJavaError(
    311                     "An error occurred while calling {0}{1}{2}.\n".
--> 312                     format(target_id, ".", name), value)
    313             else:
    314                 raise Py4JError(

Py4JJavaError: An error occurred while calling o2875.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure:   
Task 0 in stage 1084.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1084.0 (TID 42910, 169.45.92.174): 
java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
compile: org.codehaus.janino.JaninoRuntimeException: Code of method "

  (Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class    
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
grows beyond 64 KB

然后列出生成的代码，如下所示:

It then lists the generated code, which looks like:

/* 001 */ public SpecificOrdering generate(Object[] references) {
/* 002 */   return new SpecificOrdering(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificOrdering extends org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */
/* 009 */
/* 010 */   public int compareStruct1(InternalRow a, InternalRow b) {
/* 011 */     InternalRow i = null;
/* 012 */
/* 013 */     i = a;
/* 014 */     boolean isNullA836;
/* 015 */     byte primitiveA834;
/* 016 */     {
/* 017 */
/* 018 */       boolean isNull834 = i.isNullAt(0);
/* 019 */       byte value834 = isNull834 ? (byte)-1 : (i.getByte(0));
/* 020 */       isNullA836 = isNull834;
/* 021 */       primitiveA834 = value834;
/* 022 */     }
/* 023 */     i = b;

…

/* 28649 */     return 0;
/* 28650 */   }
/* 28651 */ }

为什么这个简单的 VectorAssembler 会生成 28,651 行代码并超过 64KB?

Why would this simple VectorAssembler generate 28,651 lines of code and exceed 64KB?

推荐答案

Spark 的惰性求值似乎有一个限制，即 64KB.换句话说，在这种情况下它有点懒惰导致它达到该限制.

It seems like there may be a limit, namely 64KB, to Spark's lazy evaluation. In other words, it's being a little to lazy in this case which is causing it to hit that limit.

通过在我的一个 上调用 cache，我能够解决同样的异常，我是通过 join 而不是 VectorAssembler 触发的数据集大约在我的管道中进行了一半.我不知道(还)究竟为什么这解决了这个问题.


I was able to work around this same exception, which I was triggering via a join rather than with a VectorAssembler, by calling cache on one of my Datasets about half way through my pipeline.  I don't know (yet) exactly why this solved the issue however.

                        这篇关于Spark ML 管道导致 java.lang.Exception: failed to compile ... Code ... 增长超过 64 KB的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Spark ML 管道导致 java.lang.Exception: failed to compile ... Code ... 增长超过 64 KB [英] Spark ML Pipeline Causes java.lang.Exception: failed to compile ... Code ... grows beyond 64 KB

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Spark ML 管道导致 java.lang.Exception: failed to compile ... Code ... 增长超过 64 KB [英] Spark ML Pipeline Causes java.lang.Exception: failed to compile ... Code ... grows beyond 64 KB

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭