如何在不使用HDP 3.1中的仓库连接器的情况下从Spark写入表格 [英] How to write a table to hive from spark without using the warehouse connector in HDP 3.1
问题描述
当尝试使用HDP 3.1上的spark 2.3将Hive表写入到Hive表时,不使用以下命令将仓库连接器直接放入配置单元架构中:
when trying to use spark 2.3 on HDP 3.1 to write to a Hive table without the warehouse connector directly into hives schema using:
spark-shell --driver-memory 16g --master local[3] --conf spark.hadoop.metastore.catalog.default=hive
val df = Seq(1,2,3,4).toDF
spark.sql("create database foo")
df.write.saveAsTable("foo.my_table_01")
失败:
Table foo.my_table_01 failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional
但一个:
val df = Seq(1,2,3,4).toDF.withColumn("part", col("value"))
df.write.partitionBy("part").option("compression", "zlib").mode(SaveMode.Overwrite).format("orc").saveAsTable("foo.my_table_02")
用spark.sql("select * from foo.my_table_02").show
闪烁可以正常工作.
现在去蜂巢/蜂巢:
Spark with spark.sql("select * from foo.my_table_02").show
works just fine.
Now going to Hive / beeline:
0: jdbc:hive2://hostname:2181/> select * from my_table_02;
Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0)
A
describe extended my_table_02;
返回
+-----------------------------+----------------------------------------------------+----------+
| col_name | data_type | comment |
+-----------------------------+----------------------------------------------------+----------+
| value | int | |
| part | int | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| part | int | |
| | NULL | NULL |
| Detailed Table Information | Table(tableName:my_table_02, dbName:foo, owner:hive/bd-sandbox.t-mobile.at@SANDBOX.MAGENTA.COM, createTime:1571201905, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:value, type:int, comment:null), FieldSchema(name:part, type:int, comment:null)], location:hdfs://bd-sandbox.t-mobile.at:8020/warehouse/tablespace/external/hive/foo.db/my_table_02, inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSerde, parameters:{path=hdfs://bd-sandbox.t-mobile.at:8020/warehouse/tablespace/external/hive/foo.db/my_table_02, compression=zlib, serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[FieldSchema(name:part, type:int, comment:null)], parameters:{numRows=0, rawDataSize=0, spark.sql.sources.schema.partCol.0=part, transient_lastDdlTime=1571201906, bucketing_version=2, spark.sql.create.version=2.3.2.3.1.0.0-78, totalSize=740, spark.sql.sources.schema.numPartCols=1, spark.sql.sources.schema.part.0={\"type\":\"struct\",\"fields\":[{\"name\":\"value\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"part\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}, numFiles=4, numPartitions=4, spark.sql.partitionProvider=catalog, spark.sql.sources.schema.numParts=1, spark.sql.sources.provider=orc, transactional=true}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, rewriteEnabled:false, catName:hive, ownerType:USER, writeId:-1) |
如何在不使用仓库连接器的情况下使用spark来写入蜂巢,但仍然写入同一元存储库,以后蜂巢可以读取该元存储库?
据我所知,外部表应该是可能的(您的表不是托管的,不是ACID的不是事务性的),但是我不确定如何告诉saveAsTable
如何处理这些表.
How can I use spark to write to hive without using the warehouse connector but still writing to the same metastore which can later on be read by hive?
To my best knowledge external tables should be possible (thy are not managed, not ACID not transactional), but I am not sure how to tell the saveAsTable
how to handle these.
相关问题:
- 通过Spark加载的表格无法在Hive中访问
- 设置答案中建议的属性并不能解决我的问题
- https://community.cloudera.com/t5/Support-Questions/In-hdp-3-0-can-t-create-hive-table-in-spark-failed/td-p/202647
- Table loaded through Spark not accessible in Hive
- setting the properties there proposed in the answer do not solve my issue
可能是一种变通方法,例如 https://github.com/qubole/spark-acid类似于 https://docs .cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html ,但我不喜欢在未进行任何大规模性能测试的地方使用更多胶带的想法刚刚呢另外,这意味着更改所有现有的火花作业.
Might be a workaround like the https://github.com/qubole/spark-acid like https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html but I do not like the idea of using more duct tape where I have not seen any large scale performance tests just yet. Also, this means changing all existing spark jobs.
实际上无法将表保存到配置单元metastore, HDP 3.0 报告大型数据框和仓库连接器的问题.
In fact Cant save table to hive metastore, HDP 3.0 reports issues with large data frames and the warehouse connector.
我刚刚发现并且:
execute()与executeQuery()
execute() vs executeQuery()
ExecuteQuery()将始终使用Hiveserver2-interactive/LLAP,因为它 使用快速的ARROW协议.当jdbc URL指向 非LLAP Hiveserver2将产生错误.
ExecuteQuery() will always use the Hiveserver2-interactive/LLAP as it uses the fast ARROW protocol. Using it when the jdbc URL point to the non-LLAP Hiveserver2 will yield an error.
Execute()使用JDBC,并且不依赖LLAP,但是具有 内置限制,最多只能返回1.000条记录.但是对于大多数 查询(INSERT INTO ... SELECT,计数,总和,平均值)不是 问题.
Execute() uses JDBC and does not have this dependency on LLAP, but has a built-in restriction to only return 1.000 records max. But for most queries (INSERT INTO ... SELECT, count, sum, average) that is not a problem.
但这不会杀死蜂巢和Spark之间的任何高性能互操作性吗?尤其是如果没有足够的LLAP节点用于大规模ETL.
But doesn't this kill any high-performance interoperability between hive and spark? Especially if there are not enough LLAP nodes available for large scale ETL.
In fact, this is true. This setting can be configured at https://github.com/hortonworks-spark/spark-llap/blob/26d164e62b45cfa1420d5d43cdef13d1d29bb877/src/main/java/com/hortonworks/spark/sql/hive/llap/HWConf.java#L39, though I am not sure of the performance impact of increasing this value
推荐答案
创建外部表(作为一种解决方法)似乎是我的最佳选择. 这仍然涉及HWC来注册列元数据或更新分区信息.
Creating an external table (as a workaround) seems to be the best option for me. This still involves HWC to register the column metadata or update the partition information.
遵循以下原则:
val df:DataFrame = ... val externalPath = "/warehouse/tablespace/external/hive/my_db.db/my_table" import com.hortonworks.hwc.HiveWarehouseSession val hive = HiveWarehouseSession.session(spark).build() dxx.write.partitionBy("part_col").option("compression", "zlib").mode(SaveMode.Overwrite).orc(externalPath) val columns = dxx.drop("part_col").schema.fields.map(field => s"${field.name} ${field.dataType.simpleString}").mkString(", ") val ddl = s""" |CREATE EXTERNAL TABLE my_db.my_table ($columns) |PARTITIONED BY (part_col string) |STORED AS ORC |Location '$externalPath' """.stripMargin hive.execute(ddl) hive.execute(s"MSCK REPAIR TABLE $tablename SYNC PARTITIONS")
不幸的是,这引发了:
java.sql.SQLException: The query did not generate a result set!
来自HWC
这篇关于如何在不使用HDP 3.1中的仓库连接器的情况下从Spark写入表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!