Impala:如何查询具有不同架构的多个镶木地板文件 [英] Impala: How to query against multiple parquet files with different schemata

查看:30
本文介绍了Impala:如何查询具有不同架构的多个镶木地板文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Spark 2.1 中我经常使用类似

in Spark 2.1 I often use something like

df = spark.read.parquet(/path/to/my/files/*.parquet) 

即使具有不同的架构,也可以加载包含镶木地板文件的文件夹.然后我使用 SparkSQL 对数据框执行一些 SQL 查询.

to load a folder of parquet files even with different schemata. Then I perform some SQL queries against the dataframe using SparkSQL.

现在我想尝试 Impala,因为我阅读了 wiki 文章,其中包含诸如:

Now I want to try Impala because I read the wiki article, which containing sentences like:

Apache Impala 是一个开源的大规模并行处理 (MPP) SQL用于存储在运行 Apache Hadoop 的计算机集群中的数据的查询引擎 [...].

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop [...].

读取 Hadoop 文件格式,包括文本、LZO、SequenceFile、Avro、RCFile 和 Parquet.

Reads Hadoop file formats, including text, LZO, SequenceFile, Avro, RCFile, and Parquet.

所以听起来它也适合我的用例(并且执行速度可能会更快一些).

So it sounds like it could also fit to my use case (and performs maybe a bit faster).

但是当我尝试以下操作时:

But when I try things like:

CREATE EXTERNAL TABLE ingest_parquet_files LIKE PARQUET 
'/path/to/my/files/*.parquet'
STORED AS PARQUET
LOCATION '/tmp';

我收到一个 AnalysisException

I get an AnalysisException

AnalysisException:无法推断架构,路径不是文件

AnalysisException: Cannot infer schema, path is not a file

那么现在我的问题是:是否可以使用 Impala 读取包含多个镶木地板文件的文件夹?Impala 会像 spark 一样执行模式合并吗?执行此操作需要什么查询?无法使用 Google 找到有关它的任何信息.(总是一个坏兆头...)

So now my questions: Is it even possible to read a folder containing multible parquet files with Impala? Will Impala perform a schema merge like spark? What query do I need to perform this action? Couldn't find any information about it using Google. (always a bad sign...)

谢谢!

推荐答案

据我所知,您有一些镶木地板文件,并且想通过impala 表查看它们?下面是我对此的解释.

From what I understand, you have some parquet files and you want to see them through impala tables? Below is my explanation on it.

您可以创建一个外部表并将位置设置为如下所示的镶木地板文件目录

You can create an external table and set the location to the parquet files directory like below

CREATE EXTERNAL TABLE ingest_parquet_files(col1 string, col2 string) LOCATION "/path/to/my/files/" STORED AS PARQUET;

在创建表格后,您还有另一个选择加载镶木地板文件

You have another option of loading the parquet files after creating the table

LOAD DATA INPATH "Your/HDFS/PATH" INTO TABLE schema.ingest_parquet_files;

您正在尝试的也将起作用,您必须删除通配符,因为它需要在 LIKE PARQUET 之后的路径,并在该位置查找文件.

What you are trying will also work, you have to remove the wildcard character, because it expects a path after the LIKE PARQUET, and looks for the files in the location.

CREATE EXTERNAL TABLE ingest_parquet_files LIKE PARQUET 
'/path/to/my/files/'
STORED AS PARQUET
LOCATION '/tmp';

以下是您可以参考的从 Cloudera impala 中提取的模板 doc.

Below is the template you can refer which is pulled from Cloudera impala doc.

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE PARQUET 'hdfs_path_of_parquet_file'
  [COMMENT 'table_comment']
  [PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)]
  [WITH SERDEPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
  [
   [ROW FORMAT row_format] [STORED AS file_format]
  ]
  [LOCATION 'hdfs_path']
  [TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
  [CACHED IN 'pool_name' [WITH REPLICATION = integer] | UNCACHED]
data_type:
    primitive_type
  | array_type
  | map_type
  | struct_type

请注意,您使用的用户应该对您提供给 impala 的任何路径具有读写访问权限.您可以通过执行以下步骤来实现它

Please note that the user you are using should have read-write access to any path you are giving to impala. You can achieve it by performing with the below steps

#Login as hive superuser to perform the below steps
create role <role_name_x>;

#For granting to database
grant all on database to role <role_name_x>;

#For granting to HDFS path
grant all on URI '/hdfs/path' to role <role_name_x>;

#Granting the role to the user you will use to run the impala job
grant role <role_name_x> to group <your_user_name>;

#After you perform the below steps you can validate with the below commands
#grant role should show the URI or database access when you run the grant role check on the role name as below

show grant role <role_name_x>;

#Now to validate if the user has access to the role

show role grant group <your_user_name>;

更多关于角色和权限如何这里

More on how the roles and permissions are here

这篇关于Impala:如何查询具有不同架构的多个镶木地板文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆