加载镶木地板时不支持Spark Exception Complex类型 [英] Spark Exception Complex types not supported while loading parquet

查看:98
本文介绍了加载镶木地板时不支持Spark Exception Complex类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将Spark中的Parquet File加载为数据帧-

I am trying to load Parquet File in Spark as dataframe-

val df= spark.read.parquet(path)

我得到-

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 12, 10.250.2.32): java.lang.UnsupportedOperationException: Complex types not supported.

在遍历代码时,我意识到在Spark VectorizedParquetRecordReader.java(initializeInternal)中有一个检查-

While going through the code, I realized there is a check in spark VectorizedParquetRecordReader.java (initializeInternal)-

Type t = requestedSchema.getFields().get(i);
  if (!t.isPrimitive() || t.isRepetition(Type.Repetition.REPEATED)) {
throw new UnsupportedOperationException("Complex types not supported.");
}

所以我认为它在isRepetition方法上失败了. 有人可以建议我解决问题的方法吗?

So I think its failing on isRepetition method. Can anybody suggest me the way to solve the issue ?

我的实木复合地板数据就像-

My Parquet Data is like -

Key1 = value1
Key2 = value1
Key3 = value1
Key4:
.list:
..element:
...key5:
....list:
.....element:
......certificateSerialNumber = dfsdfdsf45345
......issuerName = CN=Microsoft Windows Verification PCA, O=Microsoft Corporation, L=Redmond, S=Washington, C=US
......subjectName = CN=Microsoft Windows, OU=MOPR, O=Microsoft Corporation, L=Redmond, S=Washington, C=US
......thumbprintAlgorithm = Sha1
......thumbprintContent = sfdasf42dsfsdfsdfsd
......validFrom = 2009-12-07 21:57:44.000000
......validTo = 2011-03-07 21:57:44.000000
....list:
.....element:
......certificateSerialNumber = dsafdsafsdf435345
......issuerName = CN=Microsoft Root Certificate Authority, DC=microsoft, DC=com
......subjectName = CN=Microsoft Windows Verification PCA, O=Microsoft Corporation, L=Redmond, S=Washington, C=US
......thumbprintAlgorithm = Sha1
......thumbprintContent = sdfsdfdsf43543
......validFrom = 2005-09-15 21:55:41.000000
......validTo = 2016-03-15 22:05:41.000000

我怀疑key4可能由于嵌套而引起了问题 树.输入数据为Json类型,因此可能不是拼花地板 了解像Json这样的复杂级别

And I suspect the key4 may be raising the issue because of nested tree. The input data is of Json type, so may be parquet doesn't understand that complex levels as Json

我在Spark中发现了一个错误 https://issues.apache.org/jira/browse/HIVE-13744

I found a bug in Spark https://issues.apache.org/jira/browse/HIVE-13744

,但它指出了Hive复杂类型问题.不确定,这是否可以解决镶木地板的问题?

but it states Hive Complex Type Issue. Not Sure, this will fix the issue with parquet or not?

更新1 进一步研究镶木地板,我得出以下结论-

Update 1 Further exploring the parquet and I concluded following -

我在spark.write时创建了5个实木复合地板文件 在这两个实木复合地板文件中,由于是空的,所以原本应该为ArrayType的列的架构将作为String类型出现,当我尝试整体读取它时,我看到了上面的异常

I have 5 parquet file created while spark.write Among that 2 parquet file is empty so the schema for a column which was supposed to be ArrayType is coming as String type and when I am trying to read it as whole, I saw the above exception

推荐答案

参加1
SPARK-12854矢量化镶木地板阅读器表示"ColumnarBatch支持结构和数组" (请参见 GitHub拉请求10820 ),从Spark 2.0开始.0

Take 1
SPARK-12854 Vectorize Parquet reader indicates that "ColumnarBatch supports structs and arrays" (cf. GitHub pull request 10820) starting with Spark 2.0.0

并且 SPARK-13​​518默认情况下启用矢量化实木复合地板阅读器,同时也会启动使用Spark 2.0.0,处理属性spark.sql.parquet.enableVectorizedReader(请参见 GitHub提交e809074 )

And SPARK-13518 Enable vectorized parquet reader by default, also starting with Spark 2.0.0, deals with property spark.sql.parquet.enableVectorizedReader (cf. GitHub commit e809074)

我的2美分:停用"VectorizedReader"优化功能,然后看看会发生什么.

My 2 cents: disable that "VectorizedReader" optimization and see what happens.

参加2
由于问题已被缩小为一些空文件,这些空文件没有显示与真实"文件相同的架构,因此我花了3美分:尝试使用spark.sql.parquet.mergeSchema来查看真实文件中的架构是否优先于

Take 2
Since the problem has been narrowed down to some empty files that do not display the same schema as "real" files, my 3 cents: experiment with spark.sql.parquet.mergeSchema to see if the schema from real files takes precedence after merging.

除此之外,您可能会尝试在写入时清除空文件,并进行某种重新分区,例如coalesce(1) (好吧,1有点讽刺,但您明白了这一点).

Other than that, you might try to eradicate the empty files at write time, with some kind of re-partitioning e.g. coalesce(1) (OK, 1 is a bit caricatural, but you see the point).

这篇关于加载镶木地板时不支持Spark Exception Complex类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆