使用 pyspark 更改配置单元表后的架构错误 [英] Schema error after altering hive table with pyspark
问题描述
我在 hive 中有一个名为 test
的表,其中包含 id
和 name
I have a table in hive with called test
with columns id
and name
现在我在 hive 中有另一个名为 mysql 的表,其中包含 id
、name
和 city
.
Now I have another table in hive called mysql with columns id
, name
and city
.
现在我想比较两个表的模式并将列差异添加到 hive 表 test
.
Now I want to compare schema of both tables and add column difference to the hive table test
.
hive_df= sqlContext.table("testing.test")
mysql_df= sqlContext.table("testing.mysql")
hive_df.dtypes
[('id', 'int'), ('name', 'string')]
mysql_df.dtypes
[('id', 'int'), ('name', 'string'), ('city', 'string')]
hive_dtypes=hive_df.dtypes
hive_dtypes
[('id', 'int'), ('name', 'string')]
mysql_dtypes= mysql_df.dtypes
diff = set(mysql_dtypes) ^ set(hive_dtypes)
diff
set([('city', 'string')])
for col_name, col_type in diff:
... sqlContext.sql("ALTER TABLE testing.test ADD COLUMNS ({0} {1})".format(col_name, col_type))
...
完成所有这些操作后,hive 表 test
将按预期添加带有空值的新列 city
.
After doing all this the hive table test
will have new column city
added with null values as expected.
现在,当我关闭 spark 会话并打开一个新的 spark 会话时以及何时执行
Now when I close the spark session and open a new spark session and when I do
hive_df= sqlContext.table("testing.test")
然后
hive_df
我应该得到
DataFrame[id: int, name: string, city: string]
但是我明白了
DataFrame[id: int, name: string]
当我做一个 desc hive table test
When I do a desc hive table test
hive> desc test;
OK
id int
name string
city string
为什么在我们更改相应的 hive 表后,架构更改没有反映在 Pyspark 数据框中?
Why is the schema change not reflecting in the Pyspark dataframe after we alter the corresponding hive table?
仅供参考,我使用的是 spark 1.6
FYI I am using spark 1.6
推荐答案
看起来这个问题有一个 Jira https://issues.apache.org/jira/browse/SPARK-9764 已在 Spark 2.0 中修复.
Looks like there is a Jira for this issue https://issues.apache.org/jira/browse/SPARK-9764 which has been fixed in Spark 2.0.
对于使用 spark 1.6 的用户,请尝试使用 sqlContext
创建表.
For those using spark 1.6, try creating tables using sqlContext
.
像首先将数据帧注册为临时表
然后做
sqlContext.sql("create table table as select * from temptable")
这样,在更改 hive 表并重新创建 spark 数据框后,df
也将具有新添加的列.
This way after you alter the hive table and when you recreate the spark data frame, the df
will have the newly added columns as well.
此问题已在@zero323 的帮助下解决
This issue was resolved with the help of @zero323
这篇关于使用 pyspark 更改配置单元表后的架构错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!