如何通过Elastic MapReduce使用外部数据 [英] How to use external data with Elastic MapReduce

查看:141
本文介绍了如何通过Elastic MapReduce使用外部数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从Amazon的EMR常见问题解答中:

From Amazon's EMR FAQ:

问:我可以从互联网或Amazon S3以外的其他地方加载我的数据吗?

Q: Can I load my data from the internet or somewhere other than Amazon S3?

是的.您的Hadoop应用程序可以从Internet上的任何位置或其他AWS服务加载数据.请注意,如果您从Internet加载数据,则将收取EC2带宽费用. Amazon Elastic MapReduce还提供对DynamoDB中数据的基于Hive的访问.

Yes. Your Hadoop application can load the data from anywhere on the internet or from other AWS services. Note that if you load data from the internet, EC2 bandwidth charges will apply. Amazon Elastic MapReduce also provides Hive-based access to data in DynamoDB.

从外部(非S3)源加载数据的规范是什么?该选项似乎资源匮乏,而且似乎没有任何形式的记录.

What are the specifications for loading data from external (non-S3) sources? There seems to be a dearth of resources around this option and doesn't appear to be documented in any form.

推荐答案

如果您想通过"hadoop方法"进行操作,则应在数据源上实施DFS,或将对源URL的引用放入某个文件中,将输入MR作业.
同时,Hadoop是关于将代码移动到数据的.在这种情况下,甚至S3上的EMR都不是理想的-EC2和S3是不同的集群.因此,如果数据源在物理上不在数据中心之外,那么很难有效地执行MR处理.

If you want to do it "a hadoop way" you should implement DFS over your data source, or to put referances to your source URLs into some file, which will be input for the MR job.
In the same time hadoop is about moving code to data. Even EMR over S3 is not ideal in this perspectice - EC2 and S3 are different cluster. So it is hard to imegine effective MR procesing if datasource is phisically outside of the data center.

这篇关于如何通过Elastic MapReduce使用外部数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆