最快的方式从一个线程的数据传递到另一 [英] fastest possible way to pass data from one thread to another
问题描述
我使用升压 spsc_queue
来我的东西移动从一个线程到另一个。它在我的软件中关键的地方之一,所以我想尽快做。我写这个测试程序:
的#include<升压/ lockfree / spsc_queue.hpp>
#包括LT&;&stdint.h GT;#包括LT&;&condition_variable GT;
#包括LT&;螺纹>const int的N_TESTS = 1000;INT结果[N_TESTS]提高:: lockfree :: spsc_queue<的int64_t,提振:: lockfree ::能力和LT; 1024 GT;> testQueue;使用std ::时辰::纳秒;
使用std ::时辰:: duration_cast;INT totalQueueNano(0);
INT totalQueueCount(0);无效消费(){
INT I = 0;
scheduledAt的int64_t;
而(ⅰ&下; N_TESTS - 1){
而(testQueue.pop(scheduledAt)){
dequeuedAt的int64_t =(duration_cast<&纳秒GT;(
。的std ::时辰:: high_resolution_clock ::现在()time_since_epoch()))计算();
自动差异= dequeuedAt - scheduledAt;
totalQueueNano + =差异;
++ totalQueueCount;
结果[I] =差异;
++我;
}
}
的for(int i = 0; I< N_TESTS;我++){
的printf(%d个,结果[I]);
}
的printf(\\ nspsc_queue潜伏期平均纳米=%d个\\ N,totalQueueNano / totalQueueCount);
}诠释主(){
的std ::线程t(消费者);
usleep(1000000);
的for(int i = 0; I< N_TESTS;我++){
usleep(1000);
scheduledAt的int64_t =(duration_cast<&纳秒GT;(
。的std ::时辰:: high_resolution_clock ::现在()time_since_epoch()))计算();
testQueue.push(scheduledAt);
}
usleep(1000000);
返回0;
}
编译标志:
G ++ -std =的C ++ 0x -O3 -Wall -c -fmessage长度= 0 = -march本地-mtune =本地-pthread -MMD -MP -MFSRC /TestProject.d-MTSRC / TestProject.d-o的src / TestProject.o../src/TestProject.cppG ++ -pthread -oTestProject./src/TestProject.o -lpthread
在我的机器:RHEL 7.1,GCC 4.8.3,至强E5-2690 v3的我收到290-300纳秒
。- 我的测试应用程序有多好?我是不是正确测量spsc_queue延迟?
- 什么是从一个线程的数据传递到另一个当前行业的最佳时间?
- 是否使用boost spsc_queue将数据从一个线程移动到另一个不错的选择?
- 您能推荐的东西比spsc_queue快?
- 您可以写一个code这显著快做同样的工作?
UPD:需要队列机制。如果每1000纳秒第一线程产生的数据,但第二线程用时10 000纳秒来处理单个项目我需要排队的几个项目的时间很短。但我的排队是从来没有太大。固定大小的短环形缓冲区必须足够。
UPD2 因此,在短期的问题是 - 什么是最快的单生产者单个消费者队列(最可能是基于固定大小ringbuffer)?我使用升压spsc_queue我实现〜300纳秒的延迟,你可以建议一些更快?
upd3 在Java世界中有干扰物是达到50纳秒的延迟的的https://$c$c.google.com/p/disruptor/wiki/PerformanceResults 我们有没有在C ++的东西用相同的50纳秒的延迟
既然你有 INT
S,你(理想),上述措施是呼叫之间的整体延迟到推()
来的时候 POP()
收益真正
。
这是没有道理:消费者线程是忙碌的查询的队列,这是它循环而忙碌地检查是否弹出
已获取的值。
- 这是浪费,
- 如果您希望尽量减少等待时间,轮询肯定的不的要走的路
如果(IFF)要尽量减少等待时间(为一个单一的项目),我的猜的是使用一个信号同步机制的 spsc_queue
,如据我所知,并没有提供这一点。 (你需要,你使用一种<一个容器或定制的解决方案href=\"http://www.boost.org/doc/libs/1_57_0/doc/html/thread/synchronization.html#thread.synchronization.condvar_ref.condition_variable\"相对=nofollow>条件变量 /事件,...)
如果(IFF),但是,要最大限度地提高吞吐量(每一次性项目),然后测量延迟的(单)项目唤醒确实让更少的感觉。在这种情况下,你想使你有并行的最佳利用,如<一个href=\"http://stackoverflow.com/questions/29507669/fastest-possible-way-to-pass-data-from-one-thread-to-another?noredirect=1#comment47172171_29507669\">is在评论中提到:
常传递数据的最快方式是使用一个单一的线程数据的每个块。也就是说,在数据只能使用并行present。
块引用>解决您的要点:
的有多好测试程序:的,我不认为它使多大意义
在原子- 有
scheduledAt
是必需的,当你从一个线程写它从另一个读取。否则,你有UB。- 显然,任何测量差异WRT。这是一个纯粹的测量误差,不说对固有延迟任何东西。 (你可以尝试把一个聚合
结构{INT VAL;时间的int64_t;};
到队列中,从而避免了原子栅栏的当前行业的最佳时间的:毫无头绪。不知道有人在乎这一点。 (也许里面的一些东西的内核?)
的 spsc_queue的选择的:我不认为这是一个很好的选择,因为它需要轮询
的比spsc_queue更快的:见上面。使用非轮询通知。
?的写一个code这显著快做同样的工作的:没有。或者说,我不会。 =>
要报价人的回答:
- 您确定问题,并选择适当的同步机制
块引用>你的问题的问题是,是没有问题的定义
据我迄今而言,在用户的土地处理的定期的操作系统的上下文中,横螺纹通知延迟似乎完全不相关的。 什么是您的使用情况?
I'm using boost
spsc_queue
to move my stuff from one thread to another. It's one of the critical places in my software so I want to do it as soon as possible. I wrote this test program:#include <boost/lockfree/spsc_queue.hpp> #include <stdint.h> #include <condition_variable> #include <thread> const int N_TESTS = 1000; int results[N_TESTS]; boost::lockfree::spsc_queue<int64_t, boost::lockfree::capacity<1024>> testQueue; using std::chrono::nanoseconds; using std::chrono::duration_cast; int totalQueueNano(0); int totalQueueCount(0); void Consumer() { int i = 0; int64_t scheduledAt; while (i < N_TESTS - 1) { while (testQueue.pop(scheduledAt)) { int64_t dequeuedAt = (duration_cast<nanoseconds>( std::chrono::high_resolution_clock::now().time_since_epoch())).count(); auto diff = dequeuedAt - scheduledAt; totalQueueNano += diff; ++totalQueueCount; results[i] = diff; ++i; } } for (int i = 0; i < N_TESTS; i++) { printf("%d ", results[i]); } printf("\nspsc_queue latency average nano = %d\n", totalQueueNano / totalQueueCount); } int main() { std::thread t(Consumer); usleep(1000000); for (int i = 0; i < N_TESTS; i++) { usleep(1000); int64_t scheduledAt = (duration_cast<nanoseconds>( std::chrono::high_resolution_clock::now().time_since_epoch())).count(); testQueue.push(scheduledAt); } usleep(1000000); return 0; }
Compile flags:
g++ -std=c++0x -O3 -Wall -c -fmessage-length=0 -march=native -mtune=native -pthread -MMD -MP -MF"src/TestProject.d" -MT"src/TestProject.d" -o "src/TestProject.o" "../src/TestProject.cpp" g++ -pthread -o "TestProject" ./src/TestProject.o -lpthread
On my machine: RHEL 7.1, gcc 4.8.3, Xeon E5-2690 v3 I receive 290-300 nanoseconds.
- How good my test application is? Am I correctly measure "spsc_queue" latency?
- What is current industry best time to pass data from one thread to another?
- Is it good choice to use boost spsc_queue to move data from one thread to another?
- Can you recommend something faster than spsc_queue?
- Can you write a code which do same work significantly faster?
upd: queue mechanism is required. if first thread produce data every 1000 nanoseconds, but second thread spents 10 000 nanoseconds to process single item I need to "queue" several items for a short period of time. But my "queue" is never "too big". fixed-size short ring-buffer must be enough.
upd2 So in short the question is - what is the fastest single producer single consumer queue (most likely based on fixed size ringbuffer)? I'm using boost spsc_queue and I achieve ~300 ns latency, can you suggest something faster?
upd3 in java world there is disruptor that achieve 50 ns latency https://code.google.com/p/disruptor/wiki/PerformanceResults Do we have something in c++ with the same 50 ns latency?
解决方案Since you have
int
s, what you (ideally) measure above is the overall latency between a call topush()
to the timepop()
returnstrue
.This doesn't make sense: The consumer thread is busily polling the queue, that is it loops and busily checks whether
pop
has fetched a value.
- This is wasteful, and
- if you want to minimize latency, polling is certainly not the way to go
If (IFF) you want to minimize latency (for a single item), my guess would be to use a signaling synchronization mechanism,
spsc_queue
, as far as I can tell, does not provide for this. (You'd need a container or custom solution where you employ a kind of condition variable / Event, ...)If (IFF), however, you want to maximise throughput (items per time), then measuring the latency for a "wakeup" of a (single) item does make even less sense. In that case you want to make the best use of the parallelism you have, as is mentioned in a comment:
Often the fastest way to pass data is to use a single thread for each chunk of data. That is to say, use only the parallelism present in the data.
Addressing your bullet points:
How good is the test app: I do not think it makes much sense.
- Having
scheduledAt
in an atomic is required, as you write it from one thread and read it from another. Otherwise you have UB.- Obviously any measurement difference wrt. this is purely a measurement error and doesn't say anything about the inherent latency. (You could try putting an aggregate
struct {int val; int64_t time; };
into the queue, thereby avoiding the atomic fence.Current industry best time : no clue. Not sure anyone cares about this. (Maybe inside some kernel stuff?)
Choice of spsc_queue : I don't think it is a good choice because it requires polling.
faster than spsc_queue? : See above. Use non-polling notification.
write a code which do same work significantly faster? : No. Or rather, I won't. =>
To quote "man"s answer:
- you define the problem and select an appropriate synchronization mechanism
The problem with your question is that there is no problem definition.
As far as I am concerned so far, in the context of a user-land process on a regular OS, cross thread notification latency seems utterly irrelevant. What is your use case?
这篇关于最快的方式从一个线程的数据传递到另一的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!