Spark在读取实木复合地板文件时出现问题 [英] Spark issues reading parquet files

查看：238 发布时间：2020/9/4 9:06:56 scala apache-spark parquet apache-spark-dataset

本文介绍了Spark在读取实木复合地板文件时出现问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有2个实木复合地板零件文件part-00043-0bfd7e28-6469-4849-8692-e625c25485e2-c000.snappy.parquet(是2017年11月14日运行的零件文件)和part-00199-64714828-8a9e-4ae1-8735-c5102c0a834d-c000.snappy.parquet (是2017年11月16日运行的零件文件)，并且都具有相同的架构(我已通过打印架构进行了验证).

I have 2 parquet part files part-00043-0bfd7e28-6469-4849-8692-e625c25485e2-c000.snappy.parquet (is part file from 2017 Nov 14th run ) and part-00199-64714828-8a9e-4ae1-8735-c5102c0a834d-c000.snappy.parquet (is part file from 2017 Nov 16th run ) and have both having same schema (which I verified by printing schema).

我的问题是，我有10列，如果我使用Spark分别读取这2个文件，那该列就正确了.但是，如果我把这个文件放在文件夹中并尝试一起读取，总计数是正确的(2个文件的行总和)，但是从2nd文件来看，大多数列为空.只有大约2 3列具有适当的值(文件中存在值，因为如果我单独阅读它，则它会正确显示).我在这里想念什么?这是我用于测试的代码:

My problem is that I have, say 10 columns which is coming properly if I read this 2 files separately using Spark. But if I put this file is folder are try to read together, total count is coming correct (sum of rows from 2 files) but from 2nd file most of the columns are null. Only some 2 3 columns have proper values (values are present in file since its showing properly if I read it alone). What is that I am missing here? Here is my code used for testing:

def initSparkConfig: SparkSession = {

    val sparkSession: SparkSession = SparkSession
      .builder()
      .appName("test")
      .master("local")
      .getOrCreate()

    sparkSession.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
    sparkSession.sparkContext.getConf.set("spark.hadoop.parquet.enable.summary-metadata", "false")
    sparkSession.sparkContext.getConf.set("spark.sql.parquet.mergeSchema", "false")
    sparkSession.sparkContext.getConf.set("spark.sql.parquet.filterPushdown", "false")
    sparkSession.sparkContext.getConf.set("spark.sql.hive.metastorePartitionPruning", "true")

    sparkSession
  }

sparkSession = initSparkConfig
sparkSession.read.parquet("/test_spark/").createOrReplaceTempView("table")
sparkSession.sql("select * from table").show

更新:

如果我分别读取两个文件并进行合并和读取，则所有列将被填充而不会出现任何问题.

If I read both files separately and do a union and read, all columns gets populated without any issues.

更新2:

如果我在阅读时输入mergeSchema = true，则会抛出异常Found duplicate column(s) in the data schema and the partition schema: [即将为空的列的列表] .并且其中一个过滤器列为ambiguous

If I make mergeSchema = true while reading, it throws an exception Found duplicate column(s) in the data schema and the partition schema: [List of columns which are coming null ] . And one of the filter column as ambiguous

Spark在读取实木复合地板文件时出现问题 [英] Spark issues reading parquet files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark在读取实木复合地板文件时出现问题 [英] Spark issues reading parquet files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭