如何根据条件添加新列(无需面对 JaninoRuntimeException 或 OutOfMemoryError)? [英] How to add new columns based on conditions (without facing JaninoRuntimeException or OutOfMemoryError)?

查看:23
本文介绍了如何根据条件添加新列(无需面对 JaninoRuntimeException 或 OutOfMemoryError)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试根据这样的条件创建具有多个附加列的 spark 数据框

Trying to create a spark data frame with multiple additional columns based on conditions like this

df
    .withColumn("name1", someCondition1)
    .withColumn("name2", someCondition2)
    .withColumn("name3", someCondition3)
    .withColumn("name4", someCondition4)
    .withColumn("name5", someCondition5)
    .withColumn("name6", someCondition6)
    .withColumn("name7", someCondition7)

如果添加了超过 6 个 .withColumn 子句,我将面临以下异常

I am faced with the following exception in case more than 6 .withColumn clauses are added

org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB

此问题也已在其他地方报告过,例如

This problem has been reported elsewhere as well e.g.

spark 中是否有可以配置大小的属性?

Is there a property in spark where I can configure the size?

如果创建更多列,例如大约 20 我不再收到上述异常,而是在等待 5 分钟后收到以下错误:

if even more columns are created e.g. around 20 I do no longer receive the aforementioned exception, but rather get the following error after 5 minutes of waiting:

java.lang.OutOfMemoryError: GC overhead limit exceeded

我想要执行的是拼写/错误更正.一些简单的情况可以通过地图轻松处理.UDF 中的替换.尽管如此,其他几个具有多个连锁条件的情况仍然存在.

What I want to perform is a spelling/error correction. some simple cases could be handled easily via a map& replacement in a UDF. Still, several other cases with multiple chained conditions remain.

我也会跟进:https://issues.apache.org/jira/browse/SPARK-18532

可以在此处找到一个最小的可重复示例 https://gist.github.com/geoHeil/86e573014040c80c

A minimal reproducible example can be found here https://gist.github.com/geoHeil/86e5401fc57351c70fd49047c88cea05

推荐答案

此错误是由 WholeStageCodegen 和 JVM 问题引​​起的.

This error is caused by WholeStageCodegen and JVM issue.

快速回答:不可以,您不能更改限制.请看这个问题,64KB是JVM 中的最大方法大小.

Quick answer: no, you cannot change the limit. Please look at this question, 64KB is the maximum method size in JVM.

我们必须等待 Spark 中的解决方法,目前您无法更改系统参数

We must wait for a workaround in Spark, currently there's nothing you can change in system parameters

这篇关于如何根据条件添加新列(无需面对 JaninoRuntimeException 或 OutOfMemoryError)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆