使用pyspark更改配置单元表后的架构错误 [英] Schema error after altering hive table with pyspark

查看:44
本文介绍了使用pyspark更改配置单元表后的架构错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在蜂巢中有一个名为 test 的表,其中的列为 id name

I have a table in hive with called test with columns id and name

现在,我在配置单元中有了另一个名为mysql的表,其中的列为 id name city .

Now I have another table in hive called mysql with columns id, name and city.

现在,我想比较两个表的架构并将列差异添加到配置单元表 test .

Now I want to compare schema of both tables and add column difference to the hive table test.

hive_df= sqlContext.table("testing.test")

mysql_df= sqlContext.table("testing.mysql")

hive_df.dtypes

[('id', 'int'), ('name', 'string')]

mysql_df.dtypes

[('id', 'int'), ('name', 'string'), ('city', 'string')]

hive_dtypes=hive_df.dtypes

hive_dtypes

[('id', 'int'), ('name', 'string')]


mysql_dtypes= mysql_df.dtypes

diff = set(mysql_dtypes) ^ set(hive_dtypes)

diff

set([('city', 'string')])

for col_name, col_type in diff:
...  sqlContext.sql("ALTER TABLE testing.test ADD COLUMNS ({0} {1})".format(col_name, col_type))
...

完成所有这些操作后,配置单元表 test 将在新列 city 中添加预期的空值.

After doing all this the hive table test will have new column city added with null values as expected.

现在,当我关闭spark会话并打开一个新的spark会话时,以及当我这样做

Now when I close the spark session and open a new spark session and when I do

hive_df= sqlContext.table("testing.test")

然后

hive_df

我应该得到

DataFrame[id: int, name: string, city: string]

但是我明白了

DataFrame[id: int, name: string]

当我做一个蜂巢表 test

hive> desc test;
OK
id                      int
name                    string
city                    string

在更改相应的配置单元表后,为什么架构更改未反映在Pyspark数据框中?

Why is the schema change not reflecting in the Pyspark dataframe after we alter the corresponding hive table?

仅供参考,我使用的是spark 1.6

FYI I am using spark 1.6

推荐答案

此问题似乎有一个Jira, https://issues.apache.org/jira/browse/SPARK-9764 (已在Spark 2.0中修复).

Looks like there is a Jira for this issue https://issues.apache.org/jira/browse/SPARK-9764 which has been fixed in Spark 2.0.

对于使用spark 1.6的用户,请尝试使用 sqlContext 创建表.

For those using spark 1.6, try creating tables using sqlContext.

首先将数据帧注册为临时表,然后执行

sqlContext.sql("create table table as select * from temptable")

这样,在更改配置单元表之后,当您重新创建spark数据框时, df 也将具有新添加的列.

This way after you alter the hive table and when you recreate the spark data frame, the df will have the newly added columns as well.

借助@ zero323

This issue was resolved with the help of @zero323

这篇关于使用pyspark更改配置单元表后的架构错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆