多线程 - 如何尽可能多地使用 CPU? [英] Multithreading - How to use CPU as much as possible?

查看:54
本文介绍了多线程 - 如何尽可能多地使用 CPU?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用 C++ 实现 Tensorflow 自定义操作(用于自定义数据获取器),以加快我的 Tensorflow 模型.由于我的 Tensorflow 模型并没有大量使用 GPU,我相信我可以同时使用多个工作线程来实现最大性能.

I'm currently implementing Tensorflow custom op(for custom data fetcher) using C++ in order to speed up my Tensorflow model. Since my Tensorflow model doesn't use GPU a lot, I believe I can achieve maximal performance using multiple worker threads concurrently.

问题是,即使我有足够的工人,我的程序也没有利用所有的 CPU.在我的开发机器上,(4 个物理内核)它使用了大约 90% 的用户时间,4% 的系统时间,4 个工作线程和 tf.ConfigProto(inter_op_parallelism_threads=6) 选项.

The problem is, even though I have enough workers, my program doesn't utilize all CPU. In my development machine, (4 physical core) it uses about 90% of user time, 4% of sys time with 4 worker threads and tf.ConfigProto(inter_op_parallelism_threads=6)options.

有了更多的工作线程和 inter_op_parallelism_threads 选项,我的模型运行性能比以前的配置差得多.由于不擅长prpfiling,不知道代码的瓶颈在哪里.

With more worker threads and inter_op_parallelism_threads options, I get much worse model running performance than previous configuration. Since I don't good at prpfiling I don't know where is the bottleneck of my code.

是否有任何经验法则可以最大限度地提高 CPU 使用率和/或找到 Linux 中单个进程(非系统范围)的性能瓶颈/互斥锁的好工具?

Is there any rule of thumbs to maximize CPU usage and/or good tools to find performance bottleneck/mutex lock for single process(not system-wide) in Linux?

我的代码运行 python,但(几乎)每次执行都在 C++ 代码中.其中一些不是我的(Tensorflow 和 Eigen),我制作了一个可以在 Python 中动态加载的共享库,并且它正在被 Tensorflow 内核调用.Tensorflow拥有他们的线程池,我的动态库代码也拥有线程池,我的代码是线程安全.我还创建线程来同时调用 sess.run() 以调用它们.就像 Python 可以同时调用多个 HTTP 请求一样,sess.run() 发布了 GIL.我的对象是尽可能多地调用 sess.run() 以提高真实"性能,并且任何与 python 相关的分析器都不成功.

My code runs python, but (almost) every executions are in C++ code. Some of them are not mine(Tensorflow and and Eigen), and I've made a shared library that can be dynamically loaded in Python and it is being called by Tensorflow kernel. Tensorflow owns their thread pool and my dynamic library code also owns thread pool, and my code is thread safe. I also create threads to call sess.run() concurrently in order to call them. Like Python can call multiple HTTP requests concurrently, sess.run() release GIL. My object is call sess.run() as much as possible to increase "real" performance, and any python-related profiler wasn't succesful.

推荐答案

1) 更多线程并不意味着更快.如果您有 4 个内核,则速度不能超过 1 个内核的 4 倍.

1) More threads does not mean more speed. If you have 4 cores, you cannot go any faster than 4 times 1 core.

2) 您应该做的是调整代码以在单线程执行中获得最大性能(使用编译器优化)关闭),然后打开编译器的优化器并使代码多线程化,线程数不超过内核数.

2) What you should do is tune your code for maximum performance in single-thread execution (with compiler optimization turned off), and after you have done that, turn on the compiler's optimizer and make the code multi-threaded, with no more threads than you have cores.

附言性能调优只能在编译器优化的代码上进行,这是一种常见的误解.这就解释了为什么不是这样.

P.S. It is a common misconception that performance tuning can only be done on compiler-optimized code. This explains why it's not so.

这篇关于多线程 - 如何尽可能多地使用 CPU?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆