如何根据条件添加新列(无需面对 JaninoRuntimeException 或 OutOfMemoryError)? [英] How to add new columns based on conditions (without facing JaninoRuntimeException or OutOfMemoryError)?
问题描述
尝试根据这样的条件创建具有多个附加列的 spark 数据框
Trying to create a spark data frame with multiple additional columns based on conditions like this
df
.withColumn("name1", someCondition1)
.withColumn("name2", someCondition2)
.withColumn("name3", someCondition3)
.withColumn("name4", someCondition4)
.withColumn("name5", someCondition5)
.withColumn("name6", someCondition6)
.withColumn("name7", someCondition7)
如果添加了超过 6 个 .withColumn
子句,我将面临以下异常
I am faced with the following exception in case more than 6 .withColumn
clauses are added
org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB
此问题也已在其他地方报告过,例如
This problem has been reported elsewhere as well e.g.
- Spark ML Pipeline 导致 java.lang.Exception: 无法编译...代码...增长超过 64 KB
- https://github.com/rstudio/sparklyr/issues/264
spark 中是否有可以配置大小的属性?
Is there a property in spark where I can configure the size?
如果创建更多列,例如大约 20 我不再收到上述异常,而是在等待 5 分钟后收到以下错误:
if even more columns are created e.g. around 20 I do no longer receive the aforementioned exception, but rather get the following error after 5 minutes of waiting:
java.lang.OutOfMemoryError: GC overhead limit exceeded
我想要执行的是拼写/错误更正.一些简单的情况可以通过地图轻松处理.UDF 中的替换.尽管如此,其他几个具有多个连锁条件的情况仍然存在.
What I want to perform is a spelling/error correction. some simple cases could be handled easily via a map& replacement in a UDF. Still, several other cases with multiple chained conditions remain.
我也会跟进:https://issues.apache.org/jira/browse/SPARK-18532
可以在此处找到一个最小的可重复示例 https://gist.github.com/geoHeil/86e573014040c80c
A minimal reproducible example can be found here https://gist.github.com/geoHeil/86e5401fc57351c70fd49047c88cea05
推荐答案
此错误是由 WholeStageCodegen 和 JVM 问题引起的.
This error is caused by WholeStageCodegen and JVM issue.
快速回答:不可以,您不能更改限制.请看这个问题,64KB是JVM 中的最大方法大小.
Quick answer: no, you cannot change the limit. Please look at this question, 64KB is the maximum method size in JVM.
我们必须等待 Spark 中的解决方法,目前您无法更改系统参数
We must wait for a workaround in Spark, currently there's nothing you can change in system parameters
这篇关于如何根据条件添加新列(无需面对 JaninoRuntimeException 或 OutOfMemoryError)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!