如何制作一个自动更新Hive的表格 [英] How to make a table that is automatically updated Hive

查看:271
本文介绍了如何制作一个自动更新Hive的表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Hive中创建了一个外部表,它使用来自HDFS中的Parquet存储的数据。



当HDFS中的数据被删除时,桌子。当数据再次插入到HDFS中的相同位置时,表格不会更新以包含新数据。如果我在包含数据的现有表中插入新记录,则在运行Hive查询时不会显示新数据。



如何在Hive中创建表: p>

  CREATE EXTERNAL TABLE节点(id字符串)STORED AS PARQUET LOCATION/ hdfs / nodes; 

相关错误:

 错误:java.io.FileNotFoundException:文件不存在:/hdfs/nodes/part-r-00038-2149d17d-f890-48bc-a9dd-5ea07b0ec590.gz.parquet 
code>

我看过几篇文章解释外部表应该有最新的数据,比如here 。然而,我不是这种情况,我不知道发生了什么。



我再次将相同的数据插入到数据库中,然后查询表格。它包含与以前相同的数据量。然后我创建了一个具有不同名称的相同表格。它有两倍的数据,这是正确的金额。



这个问题可能与Metastore数据库有关。



相关信息:


  • Hive 0.13.0

  • Spark Streaming 1.4.1

  • PostgreSQL 9.3

  • CentOS 7
  • li>


编辑:
在检查Parquet文件后,我发现零件文件具有看似不兼容的文件名。

-rw-r - r-- 3 hdfs hdfs 18702811 2015-08-27 08:22 / hdfs / nodes / part-r-00000-1670f7a9-9d7c- 4206-84b5-e812d1d8fd9a.gz.parquet
-rw-r - r-- 3 hdfs hdfs 18703029 2015-08-26 15:43 / hdfs / nodes / part-r-00000-7251c663-f76e-4903 -8c5d-e0c6f61e0192.gz.parquet
-rw-r - r-- 3 hdfs hdfs 18724320 2015-08-27 08:22 / hdfs / nodes / part-r-00001-1670f7a9-9d7c-4206- 84b5-e812d1d8fd9a.gz.parquet
-rw -r - r-- 3 hdfs hdfs 18723575 2015-08-26 15:43 / hdfs / nodes / part-r-00001-7251c663-f76e-4903-8c5d -e0c6f61e0192.gz.parquet



这些文件是当Hive无法在其中找到时导致Hive出错的文件上述错误。这意味着外部表格不会动态地动作,接受目录中的任何文件(如果您在HDFS中调用它),而是可能只是在创建目录时跟踪该目录中的parquet文件列表。 b>

示例Spark代码:

nodes.foreachRDD(rdd => {
if(!rdd.isEmpty ())
sqlContext.createDataFrame(rdd.map(
n => Row(n.stuff),ParquetStore.nodeSchema)
.write.mode(SaveMode.Append).parquet(node_name )
))



其中 nodeSchema 是模式和 node_name 是/ hdfs / nodes



请参阅我有关获取让外部表检测新文件。

解决方案

为了让Hive更新表格,我不得不求助于使用Hive的分区功能。通过在每次Spark运行期间创建一个新的分区,我会在 / hdfs / nodes 目录内创建一系列目录,如下所示:

  / hdfs / nodes / timestamp =< a-timestamp> /< parquet-files> 
/ hdfs / nodes / timestamp =< a-different-timestamp> /< parquet-files>

然后,在每个Spark作业完成后,我运行Hive命令 MSCK REPAIR TABLE节点在我的Spark作业中使用 HiveContext ,它可以找到新的分区并更新表格。



我意识到这不是自动的,但至少可以工作。


I have created an external table that in Hive that uses data from a Parquet store in HDFS.

When the data in HDFS is deleted, there is no data in the table. When the data is inserted again in the same spot in HDFS, the table does not get updated to contain the new data. If I insert new records into the existing table that contains data, no new data is shown when I run my Hive queries.

How I create the table in Hive:

CREATE EXTERNAL TABLE nodes (id string) STORED AS PARQUET LOCATION "/hdfs/nodes";

The relevant error:

Error: java.io.FileNotFoundException: File does not exist: /hdfs/nodes/part-r-00038-2149d17d-f890-48bc-a9dd-5ea07b0ec590.gz.parquet

I have seen several posts that explain that external tables should have the most up to date data in them, such as here. However, this is not the case for me, and I don't know what is happening.

I inserted the same data into the database again, and queried the table. It contained the same amount of data as before. I then created an identical table with a different name. It had twice as much data in it, which was the right amount.

The issue might be with the metastore database. I am using PostgreSQL instead of Derby for the the database.

Relevant information:

  • Hive 0.13.0
  • Spark Streaming 1.4.1
  • PostgreSQL 9.3
  • CentOS 7

EDIT: After examining the Parquet files, I found that the part files have seemingly incompatible file names. -rw-r--r-- 3 hdfs hdfs 18702811 2015-08-27 08:22 /hdfs/nodes/part-r-00000-1670f7a9-9d7c-4206-84b5-e812d1d8fd9a.gz.parquet -rw-r--r-- 3 hdfs hdfs 18703029 2015-08-26 15:43 /hdfs/nodes/part-r-00000-7251c663-f76e-4903-8c5d-e0c6f61e0192.gz.parquet -rw-r--r-- 3 hdfs hdfs 18724320 2015-08-27 08:22 /hdfs/nodes/part-r-00001-1670f7a9-9d7c-4206-84b5-e812d1d8fd9a.gz.parquet -rw-r--r-- 3 hdfs hdfs 18723575 2015-08-26 15:43 /hdfs/nodes/part-r-00001-7251c663-f76e-4903-8c5d-e0c6f61e0192.gz.parquet

These files are the files that causes Hive to error when it can't find it in the error described above. This means that the external table is not acting dynamically, accepting any files in the directory (if you call it that in HDFS), but instead is probably just keeping track of the list of parquet files inside the directory when it was created.

Sample Spark code: nodes.foreachRDD(rdd => { if (!rdd.isEmpty()) sqlContext.createDataFrame(rdd.map( n => Row(n.stuff), ParquetStore.nodeSchema) .write.mode(SaveMode.Append).parquet(node_name) })

Where the nodeSchema is the schema and node_name is "/hdfs/nodes"

See my other question about getting Hive external tables to detect new files.

解决方案

In order to get Hive to update its tables, I had to resort to using the partitioning feature of Hive. By creating a new partition during each Spark run, I create a series of directories internal to the /hdfs/nodes directory like this:

/hdfs/nodes/timestamp=<a-timestamp>/<parquet-files>
/hdfs/nodes/timestamp=<a-different-timestamp>/<parquet-files>

Then, after each Spark job completes, I run the Hive command MSCK REPAIR TABLE nodes using a HiveContext in my Spark job, which finds new partitions and updates the table.

I realize this isn't automatic, but it at least works.

这篇关于如何制作一个自动更新Hive的表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆