将Spark数据框保存到Hive:表不可读,因为“parquet not a SequenceFile” [英] save Spark dataframe to Hive: table not readable because "parquet not a SequenceFile"
问题描述
我希望使用PySpark将Spark(v 1.3.0)数据框中的数据保存到Hive表中。 documentation 指出:
spark.sql.hive.convertMetastoreParquet:当设置为false时,Spark SQL将使用Hive SerDe而不是内置支持。
查看 Spark教程,似乎可以设置这个属性:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql(SET spark.sql.hive.convertMetastoreParquet = false)
#代码创建数据框
my_dataframe.saveAsTable(my_dataframe)
然而,当我尝试在Hive中查询已保存的表并返回:
hive> select * from my_dataframe;
OK
以异常java.io.IOException失败:java.io.IOException:
hdfs://hadoop01.woolford.io:8020 / user / hive / warehouse / my_dataframe / part- r-00001.parquet
不是一个SequenceFile
如何保存表格以便它立即在Hive中读取?
我一直在那里...... 如果您希望从Spark创建Hive表,您可以使用以下方法: I'd like to save data in a Spark (v 1.3.0) dataframe to a Hive table using PySpark. The documentation states: "spark.sql.hive.convertMetastoreParquet: When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support." Looking at the Spark tutorial, is seems that this property can be set: However, when I try to query the saved table in Hive it returns: How do I save the table so that it's immediately readable in Hive? I've been there... If you wish to create a Hive table from Spark, you can use this approach: 这篇关于将Spark数据框保存到Hive:表不可读,因为“parquet not a SequenceFile”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
API有点误导在这一个。
DataFrame.saveAsTable
不 创建一个Hive表,但是一个内部的Spark表源。 >
它还将某些内容存储到Hive Metastore中,但不是您想要的内容。
This
1.使用创建表...
通过SparkSQL for Hive Metastore。
2.使用 DataFrame.insertInto(tableName,overwriteMode)
数据(Spark 1.3)
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
# code to create dataframe
my_dataframe.saveAsTable("my_dataframe")
hive> select * from my_dataframe;
OK
Failed with exception java.io.IOException:java.io.IOException:
hdfs://hadoop01.woolford.io:8020/user/hive/warehouse/my_dataframe/part-r-00001.parquet
not a SequenceFile
The API is kinda misleading on this one.
DataFrame.saveAsTable
does not create a Hive table, but an internal Spark table source.
It also stores something into Hive metastore, but not what you intend.
This remark was made by spark-user mailing list regarding Spark 1.3.
1. Use Create Table ...
via SparkSQL for Hive metastore.
2. Use DataFrame.insertInto(tableName, overwriteMode)
for the actual data (Spark 1.3)