在Hadoop中使用wget? [英] Use wget with Hadoop?

查看:383
本文介绍了在Hadoop中使用wget?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集(〜31GB,带有扩展名为.gz的压缩文件),它存在于Web位置上,我想在其上运行我的Hadoop程序。该程序是对Hadoop附带的原始WordCount示例的轻微修改。就我而言,Hadoop安装在远程机器上(我通过ssh连接到该机器,然后运行我的作业)。问题是我无法将这个大数据集传输到远程计算机上的主目录(由于磁盘使用配额)。因此,我试着搜索是否有一种方法可以使用wget获取数据集并直接将其传递到HDFS上(而不需要保存在远程计算机上的本地帐户上),但没有运气。
这种方式是否存在?任何其他建议,让这个工作?

I have a dataset (~31GB, zipped file with extension .gz) which is present on a web location, and I want to run my Hadoop program on it. The program is a slight modification from the original WordCount example that comes shipped with Hadoop. In my case, Hadoop is installed on a remote machine (to which I connect via ssh and then run my jobs). The problem is that I can't transfer this large dataset to my home directory on the remote machine (due to disk usage quota). So, I tried searching for if there's a way to use wget to get the dataset and directly pass it onto the HDFS (without saving on my local acccount on the remote machine), but no luck. Does such a way even exist? Any other suggestions to get this working?

我已经尝试使用雅虎!虚拟机,它预先配置了Hadoop,但速度太慢,加上内存耗尽,因为数据集很大。

I've already tried using Yahoo! VM which comes pre-configured with Hadoop, but it's too slow and plus runs out of memory since the dataset is large.

推荐答案

看看这里的答案:把一个远程文件转化为hadoop而不复制到本地磁盘

您可以将数据从wget传输到hdfs。

You can pipe the data from wget to hdfs.

但是,您将遇到问题 - gz不可拆分,因此您无法在其上运行分布式映射/缩减。

However, you will have a problem - gz is not splittable so you won't be able to run a distributed map/reduce on it.

我建议你在本地下载文件,将其解压缩,然后将其放入或分割成多个文件并加载到hdfs中。

I suggest you download the file locally, unzip it and then either pipe it in or split it into multiple files and load them into hdfs.

这篇关于在Hadoop中使用wget?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆