外部表未从火花流写入的镶木地板文件更新 [英] External Table not getting updated from parquet files written by spark streaming

查看:27
本文介绍了外部表未从火花流写入的镶木地板文件更新的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Spark 流将聚合输出作为镶木地板文件写入使用 SaveMode.Append 的 hdfs.我创建了一个外部表,如:

I am using spark streaming to write the aggregated output as parquet files to the hdfs using SaveMode.Append. I have an external table created like :

CREATE TABLE if not exists rolluptable
USING org.apache.spark.sql.parquet
OPTIONS (
  path "hdfs:////"
);

我的印象是,如果是外部表,查询也应该从新添加的镶木地板文件中获取数据.但是,似乎新写入的文件没有被提取.

I had an impression that in case of external table the queries should fetch the data from newly parquet added files also. But, seems like the newly written files are not being picked up.

每次删除并重新创建表都可以正常工作,但不是解决方案.

Dropping and recreating the table every time works fine but not a solution.

请建议我的表如何也能包含来自较新文件的数据.

Please suggest how can my table have the data from newer files also.

推荐答案

你在用 Spark 阅读那些表格吗?如果是这样,spark 会缓存拼花表元数据(因为模式发现可能很昂贵)

Are you reading those tables with spark? if so, spark caches parquet tables metadata (since schema discovery can be expensive)

要克服这个问题,您有两个选择:

To overcome this, you have 2 options:

  1. 将配置 spark.sql.parquet.cacheMetadata 设置为 false
  2. 在查询前刷新表:sqlContext.refreshTable("my_table")

请参阅此处了解更多详情:http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion

See here for more details: http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion

这篇关于外部表未从火花流写入的镶木地板文件更新的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆