是否可以在没有HDFS的情况下以伪分布式操作运行Hadoop? [英] Is it possible to run Hadoop in Pseudo-Distributed operation without HDFS?

查看:83
本文介绍了是否可以在没有HDFS的情况下以伪分布式操作运行Hadoop?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在探索在本地系统上运行hadoop应用程序的选项。

与许多应用程序一样,前几个版本应该能够在单个版本上运行节点,只要我们可以使用所有可用的CPU核心(是的,这与这个问题)。目前的限制是,在我们的生产系统中,我们拥有Java 1.5,因此我们必须将Hadoop 0.18.3作为最新版本(请参阅这个问题)。很遗憾,我们目前还不能使用此新功能



第一种选择是简单地以伪分布模式运行hadoop。从本质上说:创建一个完整的hadoop集群,其中的所有内容都在一个节点上运行。



这个表单的缺点是它也使用了一个完整的HDFS。这意味着为了处理输入数据,必须首先将其上载到本地存储的DFS上。因此,这需要额外的输入和输出数据传输时间,并使用额外的磁盘空间。我们希望避免这两种情况,同时保持单一节点配置。



所以我在想:是否可以重写fs.hdfs.impl设置并将其从org.apache.hadoop.dfs.DistributedFileSystem更改为(例如)org.apache.hadoop.fs.LocalFileSystem?



如果这样工作,本地hadoop群集(只能由一个节点组成)可以使用现有文件而无需任何额外的存储要求,并且可以更快地启动,因为不需要上传文件。我希望仍然有一个工作和任务跟踪器,也许还有一个namenode来控制整个事情。



有没有人试过这个?
是否可以工作,或者这个想法与预期用途相差甚远?



或者是否有更好的方式获得相同的效果:伪分布式操作没有HDFS?



感谢您的洞察。






编辑2:

这是我为hadoop 0.18.3
conf / hadoop-site.xml创建的配置,使用bajafresh4life提供的答案。

 <?xml version =1.0?> 
<?xml-stylesheet type =text / xslhref =configuration.xsl?>

<! - 将特定于站点的属性覆盖到此文件中。 - >

<配置>
<属性>
<名称> fs.default.name< /名称>
< value> file:///< / value>
< / property>

<属性>
<名称> mapred.job.tracker< / name>
< value> localhost:33301< /值>
< / property>

<属性>
<名称> mapred.job.tracker.http.address< / name>
< value> localhost:33302< /值>
< description>
服务器将侦听的作业跟踪器http服务器地址和端口。
如果端口是0,那么服务器将在一个空闲端口上启动。
< / description>
< / property>

<属性>
<名称> mapred.task.tracker.http.address< / name>
< value> localhost:33303< /值>
< description>
任务跟踪器http服务器地址和端口。
如果端口是0,那么服务器将在一个空闲端口上启动。
< / description>
< / property>

< / configuration>


解决方案

是的,这是可能的,尽管我正在使用0.19.2。我对0.18.3不太熟悉,但我确信它不应该有所作为。



只要确保将fs.default.name 设置为默认值(即 file:/// ),并将 mapred.job .tracker 设置为指向托管jobtracker的位置。然后使用bin / start-mapred.sh启动守护进程。您不需要启动namenode或datanode。此时,您应该可以使用 bin / hadoop jar来运行您的map / reduce作业...



我们使用这种配置,通过使用通过NFS安装的Netapp设备,在一小群计算机上运行Hadoop。


I'm exploring the options for running a hadoop application on a local system.

As with many applications the first few releases should be able to run on a single node, as long as we can use all the available CPU cores (Yes, this is related to this question). The current limitation is that on our production systems we have Java 1.5 and as such we are bound to Hadoop 0.18.3 as the latest release (See this question). So unfortunately we can't use this new feature yet.

The first option is to simply run hadoop in pseudo distributed mode. Essentially: create a complete hadoop cluster with everything on it running on exactly 1 node.

The "downside" of this form is that it also uses a full fledged HDFS. This means that in order to process the input data this must first be "uploaded" onto the DFS ... which is locally stored. So this takes additional transfer time of both the input and output data and uses additional disk space. I would like to avoid both of these while we stay on a single node configuration.

So I was thinking: Is it possible to override the "fs.hdfs.impl" setting and change it from "org.apache.hadoop.dfs.DistributedFileSystem" into (for example) "org.apache.hadoop.fs.LocalFileSystem"?

If this works the "local" hadoop cluster (which can ONLY consist of ONE node) can use existing files without any additional storage requirements and it can start quicker because there is no need to upload the files. I would expect to still have a job and task tracker and perhaps also a namenode to control the whole thing.

Has anyone tried this before? Can it work or is this idea much too far off the intended use?

Or is there a better way of getting the same effect: Pseudo-Distributed operation without HDFS?

Thanks for your insights.


EDIT 2:

This is the config I created for hadoop 0.18.3 conf/hadoop-site.xml using the answer provided by bajafresh4life.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>file:///</value>
  </property>

  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:33301</value>
  </property>

  <property>
    <name>mapred.job.tracker.http.address</name>
    <value>localhost:33302</value>
    <description>
    The job tracker http server address and port the server will listen on.
    If the port is 0 then the server will start on a free port.
    </description>
  </property>

  <property>
    <name>mapred.task.tracker.http.address</name>
    <value>localhost:33303</value>
    <description>
    The task tracker http server address and port.
    If the port is 0 then the server will start on a free port.
    </description>
  </property>

</configuration>

解决方案

Yes, this is possible, although I'm using 0.19.2. I'm not too familiar with 0.18.3, but I'm pretty sure it shouldn't make a difference.

Just make sure that fs.default.name is set to the default (which is file:///), and mapred.job.tracker is set to point to where your jobtracker is hosted. Then start up your daemons using bin/start-mapred.sh . You don't need to start up the namenode or datanodes. At this point you should be able to run your map/reduce jobs using bin/hadoop jar ...

We've used this configuration to run Hadoop over a small cluster of machines using a Netapp appliance mounted over NFS.

这篇关于是否可以在没有HDFS的情况下以伪分布式操作运行Hadoop?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆