在 pyspark 中获取 OutofMemoryError-GC 开销限制超出 [英] Getting OutofMemoryError- GC overhead limit exceed in pyspark
问题描述
在项目的中间,我在我的 spark sql 查询中调用了一个函数后出现了波纹管错误
in the middle of project i am getting bellow error after invoking a function in my spark sql query
我编写了一个用户定义的函数,它将采用两个字符串并在连接后将它们连接起来,它将采用最右边的 5 个子字符串长度,具体取决于总字符串长度(sql server 的 right(string,integer) 的替代方法)>
i have written a user define function which will take two string and concat them after concatenation it will take right most substring length of 5 depend on total string length(alternate method of right(string,integer) of sql server )
from pyspark.sql.types import*
def concatstring(xstring, ystring):
newvalstring = xstring+ystring
print newvalstring
if(len(newvalstring)==6):
stringvalue=newvalstring[1:6]
return stringvalue
if(len(newvalstring)==7):
stringvalue1=newvalstring[2:7]
return stringvalue1
else:
return '99999'
spark.udf.register ('rightconcat', lambda x,y:concatstring(x,y), StringType())
它单独工作正常.现在,当我在我的 spark sql 查询中将它作为列传递时,发生了此异常查询是
it works fine individually. now when i pass it in my spark sql query as column this exception occured the query is
书面查询是
spark.sql("select d.BldgID,d.LeaseID,d.SuiteID,coalesce(BLDG.BLDGNAME,('select EmptyDefault from EmptyDefault')) as LeaseBldgName,coalesce(l.OCCPNAME,('select EmptyDefault from EmptyDefault'))as LeaseOccupantName, coalesce(l.DBA, ('select EmptyDefault from EmptyDefault')) as LeaseDBA, coalesce(l.CONTNAME, ('select EmptyDefault from EmptyDefault')) as LeaseContact,coalesce(l.PHONENO1, '')as LeasePhone1,coalesce(l.PHONENO2, '')as LeasePhone2,coalesce(l.NAME, '') as LeaseName,coalesce(l.ADDRESS, '') as LeaseAddress1,coalesce(l.ADDRESS2,'') as LeaseAddress2,coalesce(l.CITY, '')as LeaseCity, coalesce(l.STATE, ('select EmptyDefault from EmptyDefault'))as LeaseState,coalesce(l.ZIPCODE, '')as LeaseZip, coalesce(l.ATTENT, '') as LeaseAttention,coalesce(l.TTYPID, ('select EmptyDefault from EmptyDefault'))as LeaseTenantType,coalesce(TTYP.TTYPNAME, ('select EmptyDefault from EmptyDefault'))as LeaseTenantTypeName,l.OCCPSTAT as LeaseCurrentOccupancyStatus,l.EXECDATE as LeaseExecDate, l.RENTSTRT as LeaseRentStartDate,l.OCCUPNCY as LeaseOccupancyDate,l.BEGINDATE as LeaseBeginDate,l.EXPIR as LeaseExpiryDate,l.VACATE as LeaseVacateDate,coalesce(l.STORECAT, (select EmptyDefault from EmptyDefault)) as LeaseStoreCategory ,rightconcat('00000',cast(coalesce(SCAT.SORTSEQ,99999) as string)) as LeaseStoreCategorySortID from Dim_CMLease_primer d join LEAS l on l.BLDGID=d.BldgID and l.LEASID=d.LeaseID left outer join SUIT on SUIT.BLDGID=l.BLDGID and SUIT.SUITID=l.SUITID left outer join BLDG on BLDG.BLDGID= l.BLDGID left outer join SCAT on SCAT.STORCAT=l.STORECAT left outer join TTYP on TTYP.TTYPID = l.TTYPID").show()
我在这里上传了查询和查询后的状态.我怎么能解决这个问题.请指导我
i have uploaded the the query and after query state here. how could i solve this problem. Kindly guide me
推荐答案
最简单的尝试是增加 spark executor 内存:spark.executor.memory=6g
确保您正在使用所有可用内存.您可以在 UI 中检查.
The simplest thing to try would be increasing spark executor memory:
spark.executor.memory=6g
Make sure you're using all the available memory. You can check that in UI.
更新 1
--conf spark.executor.extrajavaoptions="Option"
您可以将 -Xmx1024m
作为选项传递.
--conf spark.executor.extrajavaoptions="Option"
you can pass -Xmx1024m
as an option.
您当前的 spark.driver.memory
和 spark.executor.memory
是多少?
增加它们应该可以解决问题.
What's your current spark.driver.memory
and spark.executor.memory
?
Increasing them should resolve the problem.
请记住,根据 spark 文档:
Bear in mind that according to spark documentation:
请注意,使用此选项设置 Spark 属性或堆大小设置是非法的.应使用 SparkConf 对象或与 spark-submit 脚本一起使用的 spark-defaults.conf 文件设置 Spark 属性.可以使用 spark.executor.memory 设置堆大小设置.
Note that it is illegal to set Spark properties or heap size settings with this option. Spark properties should be set using a SparkConf object or the spark-defaults.conf file used with the spark-submit script. Heap size settings can be set with spark.executor.memory.
更新 2
由于 GC 开销错误是垃圾收集问题,因此也建议阅读这篇很棒的答案
As GC overhead error is garbage collcection problem would also recommend to read this great answer
这篇关于在 pyspark 中获取 OutofMemoryError-GC 开销限制超出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!