加载镶木地板时不支持Spark Exception Complex类型 [英] Spark Exception Complex types not supported while loading parquet

查看：98 发布时间：2020/9/4 5:09:38 apache-spark spark-dataframe parquet

本文介绍了加载镶木地板时不支持Spark Exception Complex类型的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将Spark中的Parquet File加载为数据帧-

I am trying to load Parquet File in Spark as dataframe-

val df= spark.read.parquet(path)

我得到-

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 12, 10.250.2.32): java.lang.UnsupportedOperationException: Complex types not supported.

在遍历代码时，我意识到在Spark VectorizedParquetRecordReader.java(initializeInternal)中有一个检查-

While going through the code, I realized there is a check in spark VectorizedParquetRecordReader.java (initializeInternal)-

Type t = requestedSchema.getFields().get(i);
  if (!t.isPrimitive() || t.isRepetition(Type.Repetition.REPEATED)) {
throw new UnsupportedOperationException("Complex types not supported.");
}

所以我认为它在isRepetition方法上失败了. 有人可以建议我解决问题的方法吗?

So I think its failing on isRepetition method. Can anybody suggest me the way to solve the issue ?

我的实木复合地板数据就像-

My Parquet Data is like -

Key1 = value1
Key2 = value1
Key3 = value1
Key4:
.list:
..element:
...key5:
....list:
.....element:
......certificateSerialNumber = dfsdfdsf45345
......issuerName = CN=Microsoft Windows Verification PCA, O=Microsoft Corporation, L=Redmond, S=Washington, C=US
......subjectName = CN=Microsoft Windows, OU=MOPR, O=Microsoft Corporation, L=Redmond, S=Washington, C=US
......thumbprintAlgorithm = Sha1
......thumbprintContent = sfdasf42dsfsdfsdfsd
......validFrom = 2009-12-07 21:57:44.000000
......validTo = 2011-03-07 21:57:44.000000
....list:
.....element:
......certificateSerialNumber = dsafdsafsdf435345
......issuerName = CN=Microsoft Root Certificate Authority, DC=microsoft, DC=com
......subjectName = CN=Microsoft Windows Verification PCA, O=Microsoft Corporation, L=Redmond, S=Washington, C=US
......thumbprintAlgorithm = Sha1
......thumbprintContent = sdfsdfdsf43543
......validFrom = 2005-09-15 21:55:41.000000
......validTo = 2016-03-15 22:05:41.000000

我怀疑key4可能由于嵌套而引起了问题树.输入数据为Json类型，因此可能不是拼花地板了解像Json这样的复杂级别

And I suspect the key4 may be raising the issue because of nested tree. The input data is of Json type, so may be parquet doesn't understand that complex levels as Json

我在Spark中发现了一个错误 https://issues.apache.org/jira/browse/HIVE-13744

I found a bug in Spark https://issues.apache.org/jira/browse/HIVE-13744

，但它指出了Hive复杂类型问题.不确定，这是否可以解决镶木地板的问题?

but it states Hive Complex Type Issue. Not Sure, this will fix the issue with parquet or not?

更新1 进一步研究镶木地板，我得出以下结论-

Update 1 Further exploring the parquet and I concluded following -

我在spark.write时创建了5个实木复合地板文件在这两个实木复合地板文件中，由于是空的，所以原本应该为ArrayType的列的架构将作为String类型出现，当我尝试整体读取它时，我看到了上面的异常

I have 5 parquet file created while spark.write Among that 2 parquet file is empty so the schema for a column which was supposed to be ArrayType is coming as String type and when I am trying to read it as whole, I saw the above exception

加载镶木地板时不支持Spark Exception Complex类型 [英] Spark Exception Complex types not supported while loading parquet

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

加载镶木地板时不支持Spark Exception Complex类型 [英] Spark Exception Complex types not supported while loading parquet

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭