为什么这个程序的多线程版本更慢? [英] Why is the multithreaded version of this program slower?

查看:42
本文介绍了为什么这个程序的多线程版本更慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试学习 pthreads,并且我一直在试验一个程序,该程序试图检测数组上的更改.函数 array_modifier() 选择一个随机元素并切换它的值(1 到 0,反之亦然)然后休眠一段时间(足够大所以不会出现竞争条件,我知道这是不好的做法).change_detector() 扫描数组,当一个元素与它的先前值不匹配且等于 1 时,检测到变化,并且 diff 数组用检测更新延迟.

I am trying to learn pthreads and I have been experimenting with a program that tries to detect the changes on an array. Function array_modifier() picks a random element and toggles it's value (1 to 0 and vice versa) and then sleeps for some time (big enough so race conditions do not appear, I know this is bad practice). change_detector() scans the array and when an element doesn't match it's prior value and it is equal to 1, the change is detected and diff array is updated with the detection delay.

当有一个 change_detector() 线程 (NTHREADS==1) 时,它必须扫描整个数组.当有更多线程时,每个线程都会分配到数组的一部分.每个检测器线程只会捕获其数组中其部分的修改,因此您需要将所有 4 个线程的捕获时间相加以获得捕获所有更改的总时间.

When there is one change_detector() thread (NTHREADS==1) it has to scan the whole array. When there are more threads each is assigned a portion of the array. Each detector thread will only catch the modifications in its part of the array, so you need to sum the catch times of all 4 threads to get the total time to catch all changes.

代码如下:

#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>

#define TIME_INTERVAL 100
#define CHANGES 5000

#define UNUSED(x) ((void) x)

typedef struct {
    unsigned int tid;
} parm;

static volatile unsigned int* my_array;
static unsigned int* old_value;
static struct timeval* time_array;
static unsigned int N;

static unsigned long int diff[NTHREADS] = {0};

void* array_modifier(void* args);
void* change_detector(void* arg);

int main(int argc, char** argv) {
    if (argc < 2) {
        exit(1);
    }

    N = (unsigned int)strtoul(argv[1], NULL, 0);

    my_array = calloc(N, sizeof(int));
    time_array = malloc(N * sizeof(struct timeval));
    old_value = calloc(N, sizeof(int));

    parm* p = malloc(NTHREADS * sizeof(parm));
    pthread_t generator_thread;
    pthread_t* detector_thread = malloc(NTHREADS * sizeof(pthread_t));

    for (unsigned int i = 0; i < NTHREADS; i++) {
        p[i].tid = i;
        pthread_create(&detector_thread[i], NULL, change_detector, (void*) &p[i]);
    }

    pthread_create(&generator_thread, NULL, array_modifier, NULL);

    pthread_join(generator_thread, NULL);

    usleep(500);

    for (unsigned int i = 0; i < NTHREADS; i++) {
        pthread_cancel(detector_thread[i]);
    }

    for (unsigned int i = 0; i < NTHREADS; i++) fprintf(stderr, "%lu ", diff[i]);
    fprintf(stderr, "\n");
    _exit(0);
}


void* array_modifier(void* arg) {
    UNUSED(arg);
    srand(time(NULL));

    unsigned int changing_signals = CHANGES;

    while (changing_signals--) {
        usleep(TIME_INTERVAL);
        const unsigned int r = rand() % N;

        gettimeofday(&time_array[r], NULL);
        my_array[r] ^= 1;
    }

    pthread_exit(NULL);
}

void* change_detector(void* arg) {
    const parm* p = (parm*) arg;
    const unsigned int tid = p->tid;
    const unsigned int start = tid * (N / NTHREADS) +
                               (tid < N % NTHREADS ? tid : N % NTHREADS);
    const unsigned int end = start + (N / NTHREADS) +
                             (tid < N % NTHREADS);
    unsigned int r = start;

    while (1) {
        unsigned int tmp;
        while ((tmp = my_array[r]) == old_value[r]) {
            r = (r < end - 1) ? r + 1 : start;
        }

        old_value[r] = tmp;
        if (tmp) {
            struct timeval tv;
            gettimeofday(&tv, NULL);
            // detection time in usec
            diff[tid] += (tv.tv_sec - time_array[r].tv_sec) * 1000000 + (tv.tv_usec - time_array[r].tv_usec);
        }
    }
}

当我编译 &像这样运行:

when I compile & run like this:

gcc -Wall -Wextra -O3 -DNTHREADS=1 file.c -pthread && ./a.out 100

我明白了:

665

但是当我编译 &像这样运行:

but when I compile & run like this:

gcc -Wall -Wextra -O3 -DNTHREADS=4 file.c -pthread && ./a.out 100

我得到:

152 190 164 242

(总和为 748).

因此,多线程程序的延迟更大.

So, the delay for the multithreaded program is larger.

我的 CPU 有 6 个内核.

My cpu has 6 cores.

推荐答案

简短回答您在线程之间共享内存,线程之间共享内存很慢.

Short Answer You are sharing memory between thread and sharing memory between threads is slow.

长答案您的程序正在使用多个线程写入 my_array 和另一个线程从 my_array 读取.有效地 my_array 由多个线程共享.

Long Answer Your program is using a number of thread to write to my_array and another thread to read from my_array. Effectively my_array is shared by a number of threads.

现在假设您在多核机器上进行基准测试,您可能希望操作系统为每个线程分配不同的内核.

Now lets assume you are benchmarking on a multicore machine, you probably are hoping that the OS will assign different cores to each thread.

请记住,在现代处理器上写入 RAM 非常昂贵(数百个 CPU 周期).为了提高性能,CPU 具有多级缓存.最快的缓存是小型 L1 缓存.内核可以以 2-3 个周期的顺序写入其 L1 缓存.L2 缓存可能需要 20 - 30 个周期.

Bear in mind that on modern processors writing to RAM is really expensive (hundreds of CPU cycles). To improve performance CPUs have multi-level caches. The fastest Cache is the small L1 cache. A core can write to its L1 cache in the order of 2-3 cycles. The L2 cache may take on the order of 20 - 30 cycles.

现在在许多 CPU 架构中,每个内核都有自己的 L1 缓存,但 L2 缓存是共享的.这意味着在线程(核心)之间共享的任何数据都必须通过 L2 缓存,这比 L1 缓存慢得多.这意味着共享内存访问往往很慢.

Now in lots of CPU architectures each core has its own L1 cache but the L2 cache is shared. This means any data that is shared between thread (cores) has to go through the L2 cache which is much slower than the L1 cache. This means that shared memory access tends to be quite slow.

最重要的是,如果您希望多线程程序运行良好,您需要确保线程不共享内存.共享内存很慢.

Bottom line is that if you want your multithreaded programs to perform well you need to ensure that threads do not share memory. Sharing memory is slow.

旁白在线程之间共享内存时,永远不要依赖 volatile 来做正确的事情,要么使用库原子操作,要么使用互斥锁.这是因为某些 CPU 允许乱序读取和写入,如果您不知道自己在做什么,这可能会导致奇怪的事情.

Aside Never rely on volatile to do the correct thing when sharing memory between thread, either use your library atomic operations or use mutexes. This is because some CPUs allow out of order reads and writes that may do strange things if you do not know what you are doing.

这篇关于为什么这个程序的多线程版本更慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆