Spark在读取实木复合地板文件时出现问题 [英] Spark issues reading parquet files

查看:238
本文介绍了Spark在读取实木复合地板文件时出现问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个实木复合地板零件文件part-00043-0bfd7e28-6469-4849-8692-e625c25485e2-c000.snappy.parquet(是2017年11月14日运行的零件文件)和part-00199-64714828-8a9e-4ae1-8735-c5102c0a834d-c000.snappy.parquet (是2017年11月16日运行的零件文件),并且都具有相同的架构(我已通过打印架构进行了验证).

I have 2 parquet part files part-00043-0bfd7e28-6469-4849-8692-e625c25485e2-c000.snappy.parquet (is part file from 2017 Nov 14th run ) and part-00199-64714828-8a9e-4ae1-8735-c5102c0a834d-c000.snappy.parquet (is part file from 2017 Nov 16th run ) and have both having same schema (which I verified by printing schema).

我的问题是,我有10列,如果我使用Spark分别读取这2个文件,那该列就正确了.但是,如果我把这个文件放在文件夹中并尝试一起读取,总计数是正确的(2个文件的行总和),但是从2nd文件来看,大多数列为空.只有大约2 3列具有适当的值(文件中存在值,因为如果我单独阅读它,则它会正确显示).我在这里想念什么?这是我用于测试的代码:

My problem is that I have, say 10 columns which is coming properly if I read this 2 files separately using Spark. But if I put this file is folder are try to read together, total count is coming correct (sum of rows from 2 files) but from 2nd file most of the columns are null. Only some 2 3 columns have proper values (values are present in file since its showing properly if I read it alone). What is that I am missing here? Here is my code used for testing:

def initSparkConfig: SparkSession = {

    val sparkSession: SparkSession = SparkSession
      .builder()
      .appName("test")
      .master("local")
      .getOrCreate()

    sparkSession.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
    sparkSession.sparkContext.getConf.set("spark.hadoop.parquet.enable.summary-metadata", "false")
    sparkSession.sparkContext.getConf.set("spark.sql.parquet.mergeSchema", "false")
    sparkSession.sparkContext.getConf.set("spark.sql.parquet.filterPushdown", "false")
    sparkSession.sparkContext.getConf.set("spark.sql.hive.metastorePartitionPruning", "true")

    sparkSession
  }

sparkSession = initSparkConfig
sparkSession.read.parquet("/test_spark/").createOrReplaceTempView("table")
sparkSession.sql("select * from table").show 

更新:

如果我分别读取两个文件并进行合并和读取,则所有列将被填充而不会出现任何问题.

If I read both files separately and do a union and read, all columns gets populated without any issues.

更新2:

如果我在阅读时输入mergeSchema = true,则会抛出异常Found duplicate column(s) in the data schema and the partition schema: [即将为空的列的列表] .并且其中一个过滤器列为ambiguous

If I make mergeSchema = true while reading, it throws an exception Found duplicate column(s) in the data schema and the partition schema: [List of columns which are coming null ] . And one of the filter column as ambiguous

推荐答案

原来,这些模式不完全匹配.列名的大小写有所不同(介于两者之间).而且镶木地板列名称区分大小写,因此这引起了所有问题.它试图读取根本不存在的列.

Turns out that the schemas where not an exact match. There were difference in case (some character in between) for column names which was coming as null. And parquet column names are case sensitive, so this was causing all the issues. It was trying to read columns which was not there at all.

这篇关于Spark在读取实木复合地板文件时出现问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆