在单个(多核)CPU 设备上执行 TensorFlow [英] TensorFlow Execution on a single (multi-core) CPU Device

查看:52
本文介绍了在单个(多核)CPU 设备上执行 TensorFlow的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 TensorFlow 在只有 CPU 设备且网络仅用于推理的特定情况下的执行模型有一些疑问,例如使用 Image Recognition(https://www.tensorflow.org/tutorials/image_recognition) 多核平台的 C++ 示例.

I have some questions regarding the execution model of TensorFlow in the specific case in which there is only a CPU device and the network is used only for inference, for instance using the Image Recognition(https://www.tensorflow.org/tutorials/image_recognition) C++ Example with a multi-core platform.

下面,我将尝试总结一下我的理解,同时提出一些问题.

In the following, I will try to summarize what I understood, while asking some questions.

Session->Run()(文件 direct_session.cc)调用 ExecutorState::RynAsynch,它用根节点初始化 TensorFlow 就绪队列.

Session->Run() (file direct_session.cc) calls ExecutorState::RynAsynch, which initializes the TensorFlow ready queue with the roots nodes.

然后,指令

runner_([=]() { Process(tagged_node, scheduled_usec); }); (executor.cc, function ScheduleReady, line 2088)

将节点(以及相关操作)分配给 inter_op 池的线程.但是,我并不完全了解它是如何工作的.例如,在 ScheduleReady 尝试分配比 inter_op 池大小更多的操作的情况下,操作如何入队?(FIFO 顺序?)池的每个线程都有一个操作队列还是有一个共享队列?我在哪里可以在代码中找到它?我在哪里可以找到池的每个线程的主体?

assigns the node (and hence the related operation) to a thread of the inter_op pool. However, I do not fully understand how it works. For instance, in the case in which ScheduleReady is trying to assign more operations than the size of the inter_op pool, how operations are enqueued?(FIFO Order?) Each thread of a pool has a queue of operation or there is a single shared queue? Where can I found this in the code? Where can I found the body of each thread of the pools?

另一个问题是关于 inline_ready 管理的节点.这些(便宜的或死的)节点的执行与其他节点之一有何不同?

Another question regards the nodes managed by inline_ready. How the execution of these (inexpensive or dead) nodes, differs from the one of the other nodes?

然后,(仍然,据我所知)执行流程从 ExecutorState::Process 继续,它执行操作,区分同步和异步操作.同步和异步操作在执行方面有何不同?

Then, (still, to my understanding) the execution flow continues from ExecutorState::Process, which executes the operation, distinguishing between synchronous and asynchronous operations. How synchronous and asynchronous operations differs in terms of execution?

当操作执行时,PropagateOutputs(调用ActivateNodes)将每个后继节点的节点添加到就绪队列,由于当前节点(前驱节点)的执行而变得就绪.

When the operation is executed, then PropagateOutputs (which calls ActivateNodes) adds to the ready queue the node of every successor which is become ready thanks to the execution of the current node(predecessor).

最后,NodeDone() 调用 ScheduleReady() 来处理当前在 TensorFlow 就绪队列中的节点.

Finally, NodeDone() calls ScheduleReady() which process the nodes currently in the TensorFlow ready queue.

反之,intra_op线程池如何管理,要看具体的内核对不对?内核请求的操作可能比intra_op 线程池大小多吗?如果是,它们按哪种顺序排队?(先进先出?)

Conversely, how the intra_op thread pool is managed depends on the specific kernel, right? It is possible that a kernel requests more operations than the intra_op thread pool size? If yes, with which kind of ordering they are enqueued? (FIFO?)

一旦操作被分配给池中的线程,那么它们的调度就留给底层操作系统或者 TensorFlow 强制执行某种调度策略?

Once operations are assigned to threads of the pool, then their scheduling is left to the underlying operating system or TensorFlow enforces some kind of scheduling policy?

我在这里问是因为我在文档中几乎没有找到关于这部分执行模型的任何内容,如果我遗漏了一些文档,请指点我.

I'm asking here because I didn't find almost anything about this part of the execution model in the documentation, if I missed some documents please point me to all of them.

推荐答案

Re ThreadPool:当 Tensorflow 使用 DirectSession(就像您的情况一样)时,它使用 Eigen 的 ThreadPool.我无法获得 TensorFlow 中使用的官方 Eigen 版本的网络链接,但这里有一个指向线程池的链接 代码.此线程池正在使用此队列实现 RunQueue.每个线程有一个队列.

Re ThreadPool: When Tensorflow uses DirectSession (as it does in your case), it uses Eigen's ThreadPool. I could not get a web link to the official version of Eigen used in TensorFlow, but here is a link to the thread pool code. This thread pool is using this queue implementation RunQueue. There is one queue per thread.

重新 inline_ready:Executor:Process 被安排在某个特征线程中.当它运行时,它会执行一些节点.当这些节点完成时,它们使其他节点(张量流操作)准备就绪.其中一些节点并不昂贵.它们被添加到 inline_ready 并在同一个线程中执行,不会让步.其他节点很昂贵,并且不会在同一线程中立即"执行.它们的执行是通过 Eigen 线程池调度的.

Re inline_ready: Executor:Process is scheduled in some Eigen Thread. When it runs it executes some nodes. As these nodes are done, they make other nodes (tensorflow operations) ready. Some of these nodes are not expensive. They are added to inline_ready and executed in the same thread, without yielding. Other nodes are expensive and are not executed "immediately" in the same thread. Their execution is scheduled through the Eigen thread pool.

重新同步/异步内核:Tensorflow 操作可以由同步(大多数 CPU 内核)或异步内核(大多数 GPU 内核)支持.同步内核在运行 Process 的线程中执行.异步内核被分派到它们的设备(通常是 GPU)来执行.当异步内核完成时,它们调用 NodeDone 方法.

Re sync/async kernels: Tensorflow operations can be backed by synchronous (most CPU kernels) or asynchronous kernels (most GPU kernels). Synchronous kernels are executed in the thread running Process. Asynchronous kernels are dispatched to their device (usually GPU) to be executed. When asynchronous kernels are done, they invoke NodeDone method.

Re Intra Op ThreadPool: Intra op 线程池可用于内核并行运行计算.大多数 CPU 内核不使用它(而 GPU 内核只是分派到 GPU)并在调用 Compute 方法的线程中同步运行.根据配置,所有设备 (CPU) 共享一个内部操作线程池,或者每个设备都有自己的.内核只是在这个线程池上安排他们的工作.这是一个这样的例子 .如果任务数多于线程数,则它们以未指定的顺序进行调度和执行.这是线程池接口内核.

Re Intra Op ThreadPool: The intra op thread pool is made available to kernels to run their computation in parallel. Most CPU kernels don't use it (and GPU kernels just dispatch to GPU) and run synchronously in the thread that called the Compute method. Depending on configuration there is either one intra op thread pool shared by all devices (CPUs), or each device has its own. Kernels simply schedule their work on this thread pool. Here is an example of one such kernel. If there are more tasks than threads, they are scheduled and executed in unspecified order. Here is the ThreadPool interface exposed to kernels.

我不知道 tensorflow 以任何方式影响 OS 线程的调度.您可以要求它进行一些旋转(即不立即将线程交给操作系统)以最小化延迟(来自操作系统调度),但仅此而已.

I don't know of any way tensorflow influences the scheduling of OS threads. You can ask it to do some spinning (i.e. not immediately yield the thread to OS) to minimize latency (from OS scheduling), but that is about it.

这些内部详细信息并非有意记录在案,因为它们可能会发生变化.如果您通过 Python API 使用 tensorflow,那么您只需要知道您的操作将在输入准备就绪时执行.如果您想强制执行除此之外的某些命令,您应该使用:

These internal details are not documented on purpose as they are subject to change. If you are using tensorflow through Python API, all you should need to know that your ops will execute when their inputs are ready. If you want to enforce some order beyond this, you should use:

with tf.control_dependencies(<tensors_that_you_want_computed_before_the_ops_inside_this_block>):
  tf.foo_bar(...) 

如果您正在编写自定义 CPU 内核并希望在其中进行并行处理(对于非常昂贵的内核通常很少需要),则可以依赖上面链接的线程池接口.

If you are writing a custom CPU kernel and want to do parallelism inside it (usually needed rarely for very expensive kernels), the thread pool interface linked above is what you can rely on.

这篇关于在单个(多核)CPU 设备上执行 TensorFlow的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆