如何在 Spark 2.1 中保存分区的镶木地板文件? [英] How to save a partitioned parquet file in Spark 2.1?
问题描述
我正在尝试使用 Spark 2.1 测试如何在 HDFS 2.7 中写入数据.我的数据是一个简单的虚拟值序列,输出应按属性进行分区:id 和 key.
I am trying to test how to write data in HDFS 2.7 using Spark 2.1. My data is a simple sequence of dummy values and the output should be partitioned by the attributes: id and key.
// Simple case class to cast the data
case class SimpleTest(id:String, value1:Int, value2:Float, key:Int)
// Actual data to be stored
val testData = Seq(
SimpleTest("test", 12, 13.5.toFloat, 1),
SimpleTest("test", 12, 13.5.toFloat, 2),
SimpleTest("test", 12, 13.5.toFloat, 3),
SimpleTest("simple", 12, 13.5.toFloat, 1),
SimpleTest("simple", 12, 13.5.toFloat, 2),
SimpleTest("simple", 12, 13.5.toFloat, 3)
)
// Spark's workflow to distribute, partition and store
// sc and sql are the SparkContext and SparkSession, respectively
val testDataP = sc.parallelize(testData, 6)
val testDf = sql.createDataFrame(testDataP).toDF("id", "value1", "value2", "key")
testDf.write.partitionBy("id", "key").parquet("/path/to/file")
我希望在 HDFS 中获得以下树结构:
I am expecting to get the following tree structure in HDFS:
- /path/to/file
|- /id=test/key=1/part-01.parquet
|- /id=test/key=2/part-02.parquet
|- /id=test/key=3/part-03.parquet
|- /id=simple/key=1/part-04.parquet
|- /id=simple/key=2/part-05.parquet
|- /id=simple/key=3/part-06.parquet
但是当我运行前面的代码时,我得到以下输出:
But when I run the previous code I get the following output:
/path/to/file/id=/key=24/
|-/part-01.parquet
|-/part-02.parquet
|-/part-03.parquet
|-/part-04.parquet
|-/part-05.parquet
|-/part-06.parquet
我不知道是代码有问题,还是 Spark 正在做的其他事情.
I do not know if there is something wrong in the code, or is there something else that Spark is doing.
我正在执行 spark-submit
如下:
spark-submit --name APP --master local --driver-memory 30G --executor-memory 30G --executor-cores 8 --num-executors 8 --conf spark.io.compression.codec=lzf--conf spark.akka.frameSize=1024 --conf spark.driver.maxResultSize=1g --conf spark.sql.orc.compression.codec=未压缩 --conf spark.sql.parquet.filterPushdown=true --class myClassmyFatJar.jar
spark-submit --name APP --master local --driver-memory 30G --executor-memory 30G --executor-cores 8 --num-executors 8 --conf spark.io.compression.codec=lzf --conf spark.akka.frameSize=1024 --conf spark.driver.maxResultSize=1g --conf spark.sql.orc.compression.codec=uncompressed --conf spark.sql.parquet.filterPushdown=true --class myClass myFatJar.jar
推荐答案
我找到了解决方案!根据 Cloudera,是 mapred-site.xml 配置问题(检查下面的链接).另外,不要将数据帧写为: testDf.write.partitionBy("id", "key").parquet("/path/to/file")
I found a solution! According to Cloudera, is a mapred-site.xml configuration problem (check link below). Also, instead of writing the dataframe as: testDf.write.partitionBy("id", "key").parquet("/path/to/file")
我是这样做的:testDf.write.partitionBy("id", "key").parquet("hdfs://
.您可以分别用 HDFS 的主节点名称和端口替换
和
.
I did it as follows: testDf.write.partitionBy("id", "key").parquet("hdfs://<namenode>:<port>/path/to/file")
. You can substitute <namenode>
and <port>
with the HDFS' masternode name and port, respectively.
特别感谢@jacek-laskowski 的宝贵贡献.
Special thanks to @jacek-laskowski, for his valuable contribution.
参考文献:
https://community.cloudera.com/t5/Batch-SQL-Apache-Hive/MKDirs-failed-to-create-file/mp/36363#M1090
这篇关于如何在 Spark 2.1 中保存分区的镶木地板文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!