Tensorflow可以从Mac上的HDFS读取吗? [英] Can Tensorflow read from HDFS on Mac?

查看:433
本文介绍了Tensorflow可以从Mac上的HDFS读取吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图强制OS/X上的Tensorflow从HDFS读取.文档

I'm trying to coerce Tensorflow on OS/X to read from HDFS. The documentation

https://www.tensorflow.org/deploy/hadoop

并未明确指定这是否可行,并且代码仅引用"posix"操作系统.我在尝试使用HDFS时看到的错误如下:

does not clearly specify whether this is possible, and the code refers only to "posix" operating systems. The error I'm seeing when trying to use the HDFS is the following:

UnimplementedError(请参阅上面的回溯):未实现文件系统方案hdfs [[节点:ReaderReadV2 = ReaderReadV2 [_device ="/job:localhost/副本:0/task:0/cpu:0"] [(TFRecordReaderV2,input_producer)]]

UnimplementedError (see above for traceback): File system scheme hdfs not implemented [[Node: ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer)]]

这是我到目前为止所做的:

Here's what I've done up to this point:

  1. 酿造已安装的Hadoop 2.7.2
  2. 针对本机库分别编译了Hadoop 2.7.2. Hadoop安装在我的系统上的/usr/local/Cellar/hadoop/2.7.2/libexec上,本机库(libhdfs.dylib)在〜/Source/hadoop/hadoop-hdfs-project/hadoop-hdfs/target中/hadoop-hdfs-2.7.2/lib/native.
  1. brew installed Hadoop 2.7.2
  2. separately compiled Hadoop 2.7.2 for the native libraries. Hadoop is installed on /usr/local/Cellar/hadoop/2.7.2/libexec on my system, and the native libraries (libhdfs.dylib) are in ~/Source/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/hadoop-hdfs-2.7.2/lib/native.
  3. Edited the code at https://github.com/tensorflow/tensorflow/blob/v1.0.0/tensorflow/core/platform/hadoop/hadoop_file_system.cc#L113-L119 to read from libhdfs.dylib rather than libhdfs.so, recompiled, and reinstalled Tensorflow. (I have to admit this is pretty boneheaded, and I have no idea if it's all that's required to make this code work on Mac.)

这是要复制的代码.

test.sh:

set -x

export JAVA_HOME=$($(dirname $(which java | xargs readlink))/java_home)
export HADOOP_HOME=/usr/local/Cellar/hadoop/2.7.2/libexec

. $HADOOP_HOME/libexec/hadoop-config.sh

export HADOOP_HDFS_HOME=$(echo ~/Source/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/hadoop-hdfs-2.7.2)

export CLASSPATH=$($HADOOP_HDFS_HOME/bin/hdfs classpath --glob)

# Virtual environment with Tensorflow and necessary dependencies
. venv/bin/activate

python ./test.py

test.py:

import tensorflow as tf

_, example_bytes = tf.TFRecordReader().read(
    tf.train.string_input_producer(
        [
            "hdfs://localhost:9000/user/foo/feature_output/part-r-00000",
            "hdfs://localhost:9000/user/foo/feature_output/part-r-00001",
            "hdfs://localhost:9000/user/foo/feature_output/part-r-00002",
            "hdfs://localhost:9000/user/foo/feature_output/part-r-00003",
        ]
    )
)

with tf.Session().as_default() as sess:
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)
    print(len(sess.run(example_bytes)))

我在Tensorflow源代码中看到的代码路径似乎向我表明,如果问题确实是特定于Mac的,我将收到与上述错误不同的错误,因为某种处理程序已为"hdfs"方案,无论如何: https ://github.com/tensorflow/tensorflow/blob/v1.0.0/tensorflow/core/platform/hadoop/hadoop_file_system.cc#L474 .还有其他人成功地迫使Tensorflow与Mac一起工作吗?如果不支持,是否有容易修补的地方?

The code path I'm seeing in the Tensorflow source seems to indicate to me that I'd receive a different error than the one above if the issue were really mac-specific, since some kind of handler is registered for the "hdfs" scheme regardless: https://github.com/tensorflow/tensorflow/blob/v1.0.0/tensorflow/core/platform/hadoop/hadoop_file_system.cc#L474 . Has anyone else succeeded in coercing Tensorflow to work with Mac? If it isn't supported, is there an easy place to patch it?

对于什么是更好的方法,我也持开放态度.高层目标是考虑到每个工作人员只会读取数据的一个子集,使用共享参数服务器来有效地并行训练模型.使用本地文件系统很容易做到这一点,但是还不清楚如何扩展.即使我确实成功完成了上面的代码,结果也可能会遭受数据局部性问题的困扰.

I'm also open to suggestions as to what might be a better approach. The high-level goal is to efficiently train a model in parallel, using shared parameter servers, considering that each worker will only read a subset of the data. This is readily accomplished using the local filesystem, but it's less clear how to scale beyond that. Even if I do succeed in making the code above work, the result could suffer from problems with data locality.

此线程 https://github.com/tensorflow/tensorflow/issues/2218建议使用pyspark.RDD.toLocalIterator遍历图形中带有占位符的数据集.除了担心强制每个工作人员遍历整个数据集外,我没有看到一种方法来强制Tensorflow的内置Estimator类接受自定义的feed函数以及指定的input_fn,并且自定义的input_fn似乎是必需的,以便进行线性分类器( https://www.tensorflow.org/tutorials/linear 能够从稀疏的加权特征中学习.

This thread https://github.com/tensorflow/tensorflow/issues/2218 suggests using pyspark.RDD.toLocalIterator to iterate over the data set with a placeholder in the graph. Aside from my concern about forcing each worker to iterate through the full dataset, I don't see a way to coerce Tensorflow's builtin Estimator class to accept a custom feed function along with a specified input_fn, and a custom input_fn appears necessary in order to take advantage of models like LinearClassifier (https://www.tensorflow.org/tutorials/linear) that are capable of learning from sparse, weighted features.

有什么想法吗?

推荐答案

在构建时是否在./configure中启用了HDFS支持?如果禁用HDFS,那将是您得到的错误.

Did you enable HDFS support in ./configure when building? That's the error you would get if HDFS is disabled.

我认为您进行了正确的更改以使其正常运行.随时发送拉取请求以在macOS上查找.dylib.

I think you made the correct change to make it work. Feel free to send a pull request to look for .dylib on macOS.

这篇关于Tensorflow可以从Mac上的HDFS读取吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆