Impala:如何针对具有不同图式的多个实地拼图文件进行查询 [英] Impala: How to query against multiple parquet files with different schemata

查看:247
本文介绍了Impala:如何针对具有不同图式的多个实地拼图文件进行查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Spark 2.1中,我经常使用类似于

  df = spark.read.parquet /path/to/my/files/*.parquet)

加载一个镶木拼盘文件夹用不同的图式。
然后我使用SparkSQL对数据框执行一些SQL查询。


现在我想试试Impala,因为我读了 wiki文章,其中包含以下句子:


Apache Impala是一个开源大规模并行处理(MPP)SQL $ b ...

读取Hadoop文件格式,包括文本,LZO,SequenceFile,Avro,RCFile ,和Parquet。


所以它听起来像它也可以适合我的用例(并且执行速度可能更快)。



但是当我尝试这样的事情时:
$ b

CREATE EXTERNAL TABLE ingest_parquet_files LIKE PARQUET
'/path/to/my/files/*.parquet'
STORED AS PARQUET
LOCATION'/ tmp';

我得到一个AnalysisException


AnalysisException:无法推断模式,路径不是文件

现在我的问题:是否可以读取包含Impala的multible parquet文件的文件夹? Impala会执行类似spark的模式合并吗?我需要什么查询来执行此操作?使用Google找不到关于它的任何信息。 (总是一个不好的迹象...)



谢谢!

解决方案

据我所知,你有一些镶木地板文件,你想通过黑斑羚桌看到他们?以下是我的解释。



您可以创建一个外部表并将位置设置为parquet文件目录,如下所示:

  CREATE EXTERNAL TABLE ingest_parquet_files(col1 string,col2 string)LOCATION/ path / to / my / files /STORED AS PARQUET; 

您可以在创建表格后加载parquet文件的另一个选项

  LOAD DATA INPATHYour / HDFS / PATHINTO TABLE schema.ingest_parquet_files; 

您正在尝试的操作也会起作用,您必须删除通配符,因为它需要一个路径在LIKE PARQUET之后,并查找位置中的文件。

  CREATE EXTERNAL TABLE ingest_parquet_files LIKE PARQUET 
'/ path / to / my / files /'
存储为PARQUET
LOCATION'/ tmp';

以下是您可以参考的模板,它是从Cloudera impala doc

 CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name。] table_name 
LIKE PARQUET'hdfs_path_of_parquet_file'
[COMMENT'table_comment']
[PARTITIONED BY(col_name data_type [COMMENT'col_comment'],...)]
[WITH SERDEPROPERTIES('key1'='value1','key2'='value2',...)]
[
[ROW FORMAT row_format] [STORED AS file_format]
]
[LOCATION'hdfs_path']
[TBLPROPERTIES('key1'='value1','key2'='value2 ',...)]
[CACHED IN'pool_name'[WITH REPLICATION = integer] | UNCACHED]
data_type:
primitive_type
| array_type
| map_type
| struct_type

请注意,您正在使用的用户应该具有对您提供的任何路径的读写权限飞羚。您可以通过执行以下步骤来实现它:

 #登录为配置单元超级用户执行以下步骤
创建角色< role_name_x取代;

#授予数据库
将所有数据库授予角色< role_name_x>;

#授予HDFS路径
将所有URI'/ hdfs / path'授予角色< role_name_x>;

#将角色授予您将用于运行impala作业的用户
授予角色< role_name_x>将< your_user_name>分组;

#在执行以下步骤后,您可以使用以下命令进行验证
#grant角色应该在您对角色名称运行授予角色检查时显示URI或数据库访问权限,如下所示

显示授予角色< role_name_x>;

#现在验证用户是否有权访问该角色

显示角色授予组< your_user_name>;

更多关于角色和权限如何 here


in Spark 2.1 I often use something like

df = spark.read.parquet(/path/to/my/files/*.parquet) 

to load a folder of parquet files even with different schemata. Then I perform some SQL queries against the dataframe using SparkSQL.

Now I want to try Impala because I read the wiki article, which containing sentences like:

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop [...].

Reads Hadoop file formats, including text, LZO, SequenceFile, Avro, RCFile, and Parquet.

So it sounds like it could also fit to my use case (and performs maybe a bit faster).

But when I try things like:

CREATE EXTERNAL TABLE ingest_parquet_files LIKE PARQUET 
'/path/to/my/files/*.parquet'
STORED AS PARQUET
LOCATION '/tmp';

I get an AnalysisException

AnalysisException: Cannot infer schema, path is not a file

So now my questions: Is it even possible to read a folder containing multible parquet files with Impala? Will Impala perform a schema merge like spark? What query do I need to perform this action? Couldn't find any information about it using Google. (always a bad sign...)

Thanks!

解决方案

From what I understand, you have some parquet files and you want to see them through impala tables? Below is my explanation on it.

You can create an external table and set the location to the parquet files directory like below

CREATE EXTERNAL TABLE ingest_parquet_files(col1 string, col2 string) LOCATION "/path/to/my/files/" STORED AS PARQUET;

You have another option of loading the parquet files after creating the table

LOAD DATA INPATH "Your/HDFS/PATH" INTO TABLE schema.ingest_parquet_files;

What you are trying will also work, you have to remove the wildcard character, because it expects a path after the LIKE PARQUET, and looks for the files in the location.

CREATE EXTERNAL TABLE ingest_parquet_files LIKE PARQUET 
'/path/to/my/files/'
STORED AS PARQUET
LOCATION '/tmp';

Below is the template you can refer which is pulled from Cloudera impala doc.

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE PARQUET 'hdfs_path_of_parquet_file'
  [COMMENT 'table_comment']
  [PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)]
  [WITH SERDEPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
  [
   [ROW FORMAT row_format] [STORED AS file_format]
  ]
  [LOCATION 'hdfs_path']
  [TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
  [CACHED IN 'pool_name' [WITH REPLICATION = integer] | UNCACHED]
data_type:
    primitive_type
  | array_type
  | map_type
  | struct_type

Please note that the user you are using should have read-write access to any path you are giving to impala. You can achieve it by performing with the below steps

#Login as hive superuser to perform the below steps
create role <role_name_x>;

#For granting to database
grant all on database to role <role_name_x>;

#For granting to HDFS path
grant all on URI '/hdfs/path' to role <role_name_x>;

#Granting the role to the user you will use to run the impala job
grant role <role_name_x> to group <your_user_name>;

#After you perform the below steps you can validate with the below commands
#grant role should show the URI or database access when you run the grant role check on the role name as below

show grant role <role_name_x>;

#Now to validate if the user has access to the role

show role grant group <your_user_name>;

More on how the roles and permissions are here

这篇关于Impala:如何针对具有不同图式的多个实地拼图文件进行查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆