多线程random_r比单线程慢版 [英] Multi-threaded random_r is slower than single threaded version

查看:128
本文介绍了多线程random_r比单线程慢版的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的程序基本上是作为一个描述 href=\"http://www.ellipsix.net/blog/post.29.html\">相同。当我运行和编译使用两个线程程序(确定nthreads == 2),我得到以下运行时间:

The following program is essentially the same as the one described here. When I run and compile the program using two threads (NTHREADS == 2), I get the following run times:

real        0m14.120s
user        0m25.570s
sys         0m0.050s

当它与只有一个线程(来确定nthreads == 1),我得到的运行时间显著更好即使它是仅使用一个核心上运行。

When it is run with just one thread (NTHREADS == 1), I get run times significantly better even though it is only using one core.

real        0m4.705s
user        0m4.660s
sys         0m0.010s

我的系统是双核的,我知道random_r是线程安全的,我是pretty确保它是非阻塞的。当同一个程序而不ran​​dom_r运行和余弦和正弦的计算被用作替换,双线程版本在约1/2的时间按预期运行。

My system is dual core, and I know random_r is thread safe and I am pretty sure it is non-blocking. When the same program is run without random_r and a calculation of cosines and sines is used as a replacement, the dual-threaded version runs in about 1/2 the time as expected.

#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>

#define NTHREADS 2
#define PRNG_BUFSZ 8
#define ITERATIONS 1000000000

void* thread_run(void* arg) {
    int r1, i, totalIterations = ITERATIONS / NTHREADS;
    for (i = 0; i < totalIterations; i++){
        random_r((struct random_data*)arg, &r1);
    }
    printf("%i\n", r1);
}

int main(int argc, char** argv) {
    struct random_data* rand_states = (struct random_data*)calloc(NTHREADS, sizeof(struct random_data));
    char* rand_statebufs = (char*)calloc(NTHREADS, PRNG_BUFSZ);
    pthread_t* thread_ids;
    int t = 0;
    thread_ids = (pthread_t*)calloc(NTHREADS, sizeof(pthread_t));
    /* create threads */
    for (t = 0; t < NTHREADS; t++) {
        initstate_r(random(), &rand_statebufs[t], PRNG_BUFSZ, &rand_states[t]);
        pthread_create(&thread_ids[t], NULL, &thread_run, &rand_states[t]);
    }
    for (t = 0; t < NTHREADS; t++) {
        pthread_join(thread_ids[t], NULL);
    }
    free(thread_ids);
    free(rand_states);
    free(rand_statebufs);
}

我很困惑,为什么产生随机数当两个线程版本的性能比单线程版本差多少,考虑random_r意味着在多线程应用程序中使用。

I am confused why when generating random numbers the two threaded version performs much worse than the single threaded version, considering random_r is meant to be used in multi-threaded applications.

推荐答案

一个非常简单的更改空间数据输出内存:

A very simple change to space the data out in memory:

struct random_data* rand_states = (struct random_data*)calloc(NTHREADS * 64, sizeof(struct random_data));
char* rand_statebufs = (char*)calloc(NTHREADS*64, PRNG_BUFSZ);
pthread_t* thread_ids;
int t = 0;
thread_ids = (pthread_t*)calloc(NTHREADS, sizeof(pthread_t));
/* create threads */
for (t = 0; t < NTHREADS; t++) {
    initstate_r(random(), &rand_statebufs[t*64], PRNG_BUFSZ, &rand_states[t*64]);
    pthread_create(&thread_ids[t], NULL, &thread_run, &rand_states[t*64]);
}

在我的双核机上更快的运行时间的结果。

results in a much faster running time on my dual-core machine.

这将确认怀疑它是为了测试 - 你是在两个独立的线程在同一高速缓存行变异值,因此具有高速缓存争。香草萨特的计算机体系结构 - 你的编程语言,从来没有告诉过你谈是值得看,如果你有时间,如果你不知道的是,他展示了假共享开始在1:20左右。

This would confirm the suspicion it was meant to test - that you are mutating values on the same cache line in two separate threads, and so have cache contention. Herb Sutter's 'machine architecture - what your programming language never told you' talk is worth watching if you've got the time if you don't know about that yet, he demonstrates false sharing starting at around 1:20.

工作出你的缓存行的大小,所以它被放置在它创建每个线程的数据。

Work out your cache line size, and create each thread's data so it is aligned to it.

这是一个有点清洁剂的所有线程的数据plonk的成结构,并调整了:

It's a bit cleaner to plonk all the thread's data into a struct, and align that:

#define CACHE_LINE_SIZE 64

struct thread_data {
    struct random_data random_data;
    char statebuf[PRNG_BUFSZ];
    char padding[CACHE_LINE_SIZE - sizeof ( struct random_data )-PRNG_BUFSZ];
};

int main ( int argc, char** argv )
{
    printf ( "%zd\n", sizeof ( struct thread_data ) );

    void* apointer;

    if ( posix_memalign ( &apointer, sizeof ( struct thread_data ), NTHREADS * sizeof ( struct thread_data ) ) )
        exit ( 1 );

    struct thread_data* thread_states = apointer;

    memset ( apointer, 0, NTHREADS * sizeof ( struct thread_data ) );

    pthread_t* thread_ids;

    int t = 0;

    thread_ids = ( pthread_t* ) calloc ( NTHREADS, sizeof ( pthread_t ) );

    /* create threads */
    for ( t = 0; t < NTHREADS; t++ ) {
        initstate_r ( random(), thread_states[t].statebuf, PRNG_BUFSZ, &thread_states[t].random_data );
        pthread_create ( &thread_ids[t], NULL, &thread_run, &thread_states[t].random_data );
    }

    for ( t = 0; t < NTHREADS; t++ ) {
        pthread_join ( thread_ids[t], NULL );
    }

    free ( thread_ids );
    free ( thread_states );
}

CACHE_LINE_SIZE 64

refugio:$ gcc -O3 -o bin/nixuz_random_r src/nixuz_random_r.c -lpthread
refugio:$ time bin/nixuz_random_r 
64
63499495
944240966

real    0m1.278s
user    0m2.540s
sys 0m0.000s

或者你也可以使用双高速缓存行大小,使用malloc - 的微胖确保突变内存在不同的行,因为的malloc是16(IIRC),而不是64字节对齐

Or you can use double the cache line size, and use malloc - the extra padding ensures the mutated memory is on separate lines, as malloc is 16 (IIRC) rather than 64 byte aligned.

(我用十倍ITERATIONS减少,而不是一个愚蠢快速机)

(I reduced ITERATIONS by a factor of ten rather than having a stupidly fast machine)

这篇关于多线程random_r比单线程慢版的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆