使用pyspark更改配置单元表后的架构错误 [英] Schema error after altering hive table with pyspark
问题描述
我在蜂巢中有一个名为 test
的表,其中的列为 id
和 name
I have a table in hive with called test
with columns id
and name
现在,我在配置单元中有了另一个名为mysql的表,其中的列为 id
, name
和 city
.
Now I have another table in hive called mysql with columns id
, name
and city
.
现在,我想比较两个表的架构并将列差异添加到配置单元表 test
.
Now I want to compare schema of both tables and add column difference to the hive table test
.
hive_df= sqlContext.table("testing.test")
mysql_df= sqlContext.table("testing.mysql")
hive_df.dtypes
[('id', 'int'), ('name', 'string')]
mysql_df.dtypes
[('id', 'int'), ('name', 'string'), ('city', 'string')]
hive_dtypes=hive_df.dtypes
hive_dtypes
[('id', 'int'), ('name', 'string')]
mysql_dtypes= mysql_df.dtypes
diff = set(mysql_dtypes) ^ set(hive_dtypes)
diff
set([('city', 'string')])
for col_name, col_type in diff:
... sqlContext.sql("ALTER TABLE testing.test ADD COLUMNS ({0} {1})".format(col_name, col_type))
...
完成所有这些操作后,配置单元表 test
将在新列 city
中添加预期的空值.
After doing all this the hive table test
will have new column city
added with null values as expected.
现在,当我关闭spark会话并打开一个新的spark会话时,以及当我这样做
Now when I close the spark session and open a new spark session and when I do
hive_df= sqlContext.table("testing.test")
然后
hive_df
我应该得到
DataFrame[id: int, name: string, city: string]
但是我明白了
DataFrame[id: int, name: string]
当我做一个蜂巢表 test
hive> desc test;
OK
id int
name string
city string
在更改相应的配置单元表后,为什么架构更改未反映在Pyspark数据框中?
Why is the schema change not reflecting in the Pyspark dataframe after we alter the corresponding hive table?
仅供参考,我使用的是spark 1.6
FYI I am using spark 1.6
推荐答案
此问题似乎有一个Jira, https://issues.apache.org/jira/browse/SPARK-9764 (已在Spark 2.0中修复).
Looks like there is a Jira for this issue https://issues.apache.org/jira/browse/SPARK-9764 which has been fixed in Spark 2.0.
对于使用spark 1.6的用户,请尝试使用 sqlContext
创建表.
For those using spark 1.6, try creating tables using sqlContext
.
像首先将数据帧注册为临时表
,然后执行
sqlContext.sql("create table table as select * from temptable")
这样,在更改配置单元表之后,当您重新创建spark数据框时, df
也将具有新添加的列.
This way after you alter the hive table and when you recreate the spark data frame, the df
will have the newly added columns as well.
借助@ zero323
This issue was resolved with the help of @zero323
这篇关于使用pyspark更改配置单元表后的架构错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!