Amazon S3上的Presto [英] Presto on Amazon S3
问题描述
我正在尝试在Amazon S3存储桶上使用Presto,但是在Internet上找不到很多相关信息.
我已经在微型实例上安装了Presto,但无法弄清楚如何连接到S3.有一个存储桶,其中有文件.我有一个正在运行的Hive Metastore服务器,并且已在presto hive.properties中对其进行了配置.但是,当我尝试在配置单元中运行LOCATION命令时,它不起作用.
IT抛出错误,提示找不到文件方案类型s3.
而且我也不知道为什么我们需要运行hadoop,但是如果没有hadoop,蜂巢就不会运行.有什么解释吗?
Presto使用Hive元存储将数据库表映射到其基础文件.这些文件可以存在于S3上,并可以多种格式存储-CSV,ORC,Parquet,Seq等.
Hive Metastore通常通过发出DDL语句(例如 CREATE EXTERNAL TABLE ...
和 LOCATION ...
子句)来通过HQL(Hive查询语言)进行填充.保存数据的基础文件.
为了使Presto连接到Hive Metastore,您需要编辑hive.properties文件(EMR将其放在/etc/presto/conf.dist/catalog/
中)并进行设置适当的Hive Metastore服务的Thrift服务的 hive.metastore.uri
参数.
如果您选择Hive和Presto,则Amazon EMR集群实例将自动为您配置此项,因此这是一个不错的起点.
如果您想在独立的ec2实例上进行测试,那么我建议您首先专注于获得与Hadoop基础架构一起使用的功能性蜂巢服务.您应该能够定义在hdfs文件系统上本地驻留的表.Presto是对蜂巢的补充,但确实需要一个有效的蜂巢设置,Presto的本机ddl语句不像蜂巢那样完整,因此您将直接从蜂巢中进行大多数表的创建.
或者,您可以为mysql或postgresql数据库定义Presto连接器,但这只是一个jdbc传递,我认为您不会收获很多.
I'm trying to use Presto on Amazon S3 bucket, but haven't found much related information on the Internet.
I've installed Presto on a micro instance but I'm not able to figure out how I could connect to S3. There is a bucket and there are files in it. I have a running hive metastore server and I have configured it in presto hive.properties. But when I try to run the LOCATION command in hive, its not working.
IT throws an error saying cannot find the file scheme type s3.
And also I do not know why we need to run hadoop but without hadoop the hive doesnt run. Is there any explanation to this.
This and this are the documentations i've followed while set up.
Presto uses the Hive metastore to map database tables to their underlying files. These files can exist on S3, and can be stored in a number of formats - CSV, ORC, Parquet, Seq etc.
The Hive metastore is usually populated through HQL (Hive Query Language) by issuing DDL statements like CREATE EXTERNAL TABLE ...
with a LOCATION ...
clause referencing the underlying files that hold the data.
In order to get Presto to connect to a Hive metastore you will need to edit the hive.properties file (EMR puts this in /etc/presto/conf.dist/catalog/
) and set the hive.metastore.uri
parameter to the thrift service of an appropriate Hive metastore service.
The Amazon EMR cluster instances will automatically configure this for you if you select Hive and Presto, so it's a good place to start.
If you want to test this on a standalone ec2 instance then I'd suggest that you first focus on getting a functional hive service working with the Hadoop infrastructure. You should be able to define tables that reside locally on the hdfs file system. Presto complements hive, but does require a functioning hive set-up, presto's native ddl statements are not as feature complete as hive, so you'll do most table creation from hive directly.
Alternatively, you can define Presto connectors for a mysql or postgresql database, but it's just a jdbc pass through do I don't think you'll gain much.
这篇关于Amazon S3上的Presto的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!