运行多个 nodejs 工作线程:为什么会有这么大的开销/延迟? [英] running multiple nodejs worker threads: why of such a large overhead/latency?

查看:81
本文介绍了运行多个 nodejs 工作线程:为什么会有这么大的开销/延迟?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在试验 nodejs 工作线程的使用.我遇到了奇怪的延迟时间.

I'm experimenting nodejs worker threads usage. I'm experiencing weird latency elapsed times.

  • 我有一个 main.js 可以快速生成一系列工作线程.
  • 每个 worker.js 执行一次 cpu-boud 计算(生成素数);
  • I have a main.js that spawn a rapid sequence of worker threads.
  • Each worker.js exceute a cpu-boud computation (generate prime numbers);

顺便说一句,generatePrimes() javascipt 函数只是 cpu 绑定计算的演示示例.在我的真实案例中,工作线程是一个绑定 c++ 库的 nodejs 程序(进行语音识别,在 100% CPU 的情况下过去了半秒).

BTW, the generatePrimes() javascipt function is is just a demo example of cpu-bound calculation. In my real case, the worker thread is a nodejs program that bind an c++ library (doing a speech recognition, during half a second elapsed with 100% CPU).

  • 我的 PC 笔记本电脑:Ubuntu 20.04.2 LTS 桌面环境,有 8 个内核:

    • my PC laptop: Ubuntu 20.04.2 LTS desktop environment, has 8 cores:

      $ inxi -C -M
      Machine:   Type: Laptop System: HP product: HP Laptop 17-by1xxx v: Type1ProductConfigId serial: <superuser/root required> 
                 Mobo: HP model: 8531 v: 17.16 serial: <superuser/root required> UEFI: Insyde v: F.32 date: 12/14/2018 
      CPU:       Topology: Quad Core model: Intel Core i7-8565U bits: 64 type: MT MCP L2 cache: 8192 KiB Speed: 700 MHz min/max: 400/4600 MHz Core speeds (MHz): 1: 700 2: 700 3: 700 4: 700 5: 700 6: 700 7: 700 8: 700     
      

      $ echo "CPU threads: $(grep -c processor /proc/cpuinfo)"
      CPU threads: 8
      

    • 我体验到当 worker.js 独立运行(单线程)调用函数:generatePrimes 时,计算总共用了 ~8 秒(2, 1e7)

    • I experienced that computation has a total elapsed of ~8 seconds, when worker.js run independently (single thread) calling function: generatePrimes(2, 1e7)

      问题

      当我生成多个线程时,例如6 个线程,几乎并行(请参阅下面的代码),我预计还有大约 8 秒的时间(也许有一个小的开销),独立于数量产生的线程(它们不是并行运行,有足够的 CPU 内核吗?).相反,我总共经历了更大的可预见的~8 秒.一世总结超过~20秒?!为什么?

      以下是使用 time/pidstat 的源代码和一些经过的测量:

      Here below source codes and some elapsed measurements using time/pidstat:

      ma​​in.js

      // main.js
      const { Worker } = require('worker_threads')
           
        function runThread(workerData) {
         
          return new Promise((resolve, reject) => {
            
            const worker = new Worker('./worker.js', { workerData })
            
            worker.on('message', resolve)
            worker.on('error', reject)
            worker.on('exit', (code) => {
              if (code !== 0)
                reject(new Error(`Worker stopped with exit code ${code}`))
            })
         
          })
         
        }
         
           
        async function main() {
                   
          const numThreads = + process.argv[2]
             
          if ( !numThreads || numThreads < 1 ) {
            console.error(`usage: ${process.argv[1]} number_of_threads`)
            process.exit()
          }  
      
          const min = 2      
          const max = 1e7    
       
          //
          // run multiple threads, in "parallel":  
          //    
          // It simulates a rapid spawn ("parallel") of a specific number of thread computation. 
          // The main thread run numThreads times the same worker thread.  
          //
          // Data results of each thread elaboration is just "done"
          //
         for (let i = 0; i < numeThreads; i++ )
           setImmediate( async () => { console.log( await runThread({min, max}) ) } )
        }     
         
        if (require.main === module)
          main()
         
        module.exports = { runThread }
          
      


      worker.js

      // worker.js
      const { threadId, workerData, parentPort } = require('worker_threads')
      const { generatePrimes } = require('./generatePrimes')
      
      // take parameters from main/parente thread
      const { min, max } = workerData
      
      // synchronous long-running CPU-bound computation
      const primes = generatePrimes(min, max)
      
      // communicate result to main thread;
      // to avoid any suspect that elapsed times depend on a large amount of data exchange (the primes array in this case),
      // the returned data is just a short string. 
      parentPort.postMessage( `Done. Thread id: ${threadId}` )
      


      generatePrimes.js

      // generatePrimes.js
      // long running / CPU-bound calculation
      
      function generatePrimes(start, range) {
        
        const primes = []
        let isPrime = true
        let end = start + range
        
        for (let i = start; i < end; i++) {
          for (let j = start; j < Math.sqrt(end); j++) {
            if (i !== j && i%j === 0) {
              isPrime = false
              break
            }
          }
          if (isPrime) {
            primes.push(i)
          }
          isPrime = true
        }
      
        return primes
      }
      
      
      function main() {
      
        const min = 2
        const max = 1e7
      
        console.log( generatePrimes(min, max) )
      
      }  
      
      
      if (require.main === module) 
        main()
      
      module.exports = { generatePrimes }
      


      测试

      • 测试 1:没有工作线程 ->已用时间:~8 秒
      • 测试 2:生成 NR.1 线程 ->已用时间:~8 秒
      • 测试 3:生成 NR.6 线程 ->已用时间:~21 秒

      测试 1:没有工作线程

      generatePrimes.js 独立 ->已用时间:~8 秒

      generatePrimes.js standalone -> elapsed: ~8 seconds

      $ /usr/bin/time -f "%E" pidstat 1 -u -e node generatePrimes
      Linux 5.8.0-50-generic (giorgio-HP-Laptop-17-by1xxx)    22/04/2021  _x86_64_    (8 CPU)
      
      09:19:05      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
      09:19:06     1000    247776   98,02    0,00    0,00    0,00   98,02     5  node
      09:19:07     1000    247776  100,00    0,00    0,00    0,00  100,00     5  node
      09:19:08     1000    247776  100,00    0,00    0,00    0,00  100,00     5  node
      09:19:09     1000    247776  100,00    0,00    0,00    0,00  100,00     5  node
      09:19:10     1000    247776  100,00    0,00    0,00    0,00  100,00     5  node
      09:19:11     1000    247776  100,00    0,00    0,00    0,00  100,00     5  node
      09:19:12     1000    247776  100,00    0,00    0,00    0,00  100,00     5  node
      09:19:13     1000    247776  100,00    0,00    0,00    0,00  100,00     5  node
      [
          2,   3,   5,   7,  11,  13,  17,  19,  23,  29,  31,  37,
         41,  43,  47,  53,  59,  61,  67,  71,  73,  79,  83,  89,
         97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151,
        157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211, 223,
        227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281,
        283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353, 359,
        367, 373, 379, 383, 389, 397, 401, 409, 419, 421, 431, 433,
        439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503,
        509, 521, 523, 541,
        ... 664479 more items
      ]
      
      Average:     1000    247776   99,75    0,00    0,00    0,00   99,75     -  node
      0:08.60
      

      测试 2:生成 NR.1 线程

      main.js 生成 nr.1 个线程 ->已用时间:~8 秒(再次)

      main.js spawn nr. 1 thread -> elapsed: ~8 seconds (again)

      $ /usr/bin/time -f "%E" pidstat 1 -u -e node main 1
      Linux 5.8.0-50-generic (giorgio-HP-Laptop-17-by1xxx)    22/04/2021  _x86_64_    (8 CPU)
      
      your machine has 8 cores.
      
      
      09:21:01      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
      09:21:02     1000    247867   95,00    2,00    0,00    0,00   97,00     3  node
      09:21:03     1000    247867  100,00    0,00    0,00    0,00  100,00     3  node
      09:21:04     1000    247867  100,00    0,00    0,00    0,00  100,00     3  node
      09:21:05     1000    247867  100,00    0,00    0,00    0,00  100,00     3  node
      09:21:06     1000    247867  100,00    0,00    0,00    0,00  100,00     3  node
      09:21:07     1000    247867  100,00    0,00    0,00    0,00  100,00     3  node
      09:21:08     1000    247867  100,00    0,00    0,00    0,00  100,00     3  node
      09:21:09     1000    247867  100,00    1,00    0,00    0,00  101,00     3  node
      Done. Thread id: 1
      
      Average:     1000    247867   99,38    0,38    0,00    0,00   99,75     -  node
      0:08.50
      

      测试 3:生成 NR.6 个线程

      多 (6) 个线程.->已用时间:~21 秒(再次)

      multiple (6) threads. -> elapsed: ~21 seconds (again)

      $ /usr/bin/time -f "%E" pidstat 1 -u -e node main 6
      Linux 5.8.0-50-generic (giorgio-HP-Laptop-17-by1xxx)    22/04/2021  _x86_64_    (8 CPU)
      
      your machine has 8 cores.
      
      
      09:23:38      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
      09:23:39     1000    247946  554,00    1,00    0,00    0,00  555,00     0  node
      09:23:40     1000    247946  599,00    1,00    0,00    0,00  600,00     0  node
      09:23:41     1000    247946  600,00    1,00    0,00    0,00  601,00     0  node
      09:23:42     1000    247946  599,00    0,00    0,00    0,00  599,00     0  node
      09:23:43     1000    247946  599,00    1,00    0,00    0,00  600,00     0  node
      09:23:44     1000    247946  599,00    0,00    0,00    0,00  599,00     0  node
      09:23:45     1000    247946  600,00    0,00    0,00    0,00  600,00     0  node
      09:23:46     1000    247946  599,00    2,00    0,00    0,00  601,00     0  node
      09:23:47     1000    247946  599,00    0,00    0,00    0,00  599,00     0  node
      09:23:48     1000    247946  599,00    0,00    0,00    0,00  599,00     0  node
      09:23:49     1000    247946  600,00    1,00    0,00    0,00  601,00     0  node
      09:23:50     1000    247946  598,00    1,00    0,00    0,00  599,00     0  node
      09:23:51     1000    247946  599,00    2,00    0,00    0,00  601,00     0  node
      Done. Thread id: 1
      Done. Thread id: 4
      09:23:52     1000    247946  430,00    0,00    0,00    0,00  430,00     0  node
      09:23:53     1000    247946  398,00    0,00    0,00    0,00  398,00     0  node
      09:23:54     1000    247946  399,00    1,00    0,00    0,00  400,00     0  node
      09:23:55     1000    247946  398,00    0,00    0,00    0,00  398,00     0  node
      09:23:56     1000    247946  399,00    0,00    0,00    0,00  399,00     0  node
      09:23:57     1000    247946  396,00    3,00    0,00    0,00  399,00     0  node
      09:23:58     1000    247946  399,00    0,00    0,00    0,00  399,00     0  node
      Done. Thread id: 5
      Done. Thread id: 6
      09:23:59     1000    247946  399,00    1,00    0,00    0,00  400,00     7  node
      Done. Thread id: 2
      Done. Thread id: 3
      
      Average:     1000    247946  522,00    0,71    0,00    0,00  522,71     -  node
      0:21.05
      


      为什么我得到 ~20 秒 而不是预期的 ~8 秒?我哪里错了?


      Why I got ~20 seconds instead of expected ~8 seconds? Where I'm wrong?

      更新

      • 为了清楚起见,我将 cpu 绑定函数 generatePrimes 分离在一个单独的模块中.

      • I separated cpu-bound function generatePrimes in a separated module, just for clarity.

      我添加了更多经过的测试,将线程数从 1 增加到 9.测试表明,经过的时间随着生成的线程数的增加而增加.这对我来说毫无意义:(

      I added more elapsed tests, incrementing number of threads from 1 to 9. Tests show that elapsed time increase with the number of spawned threads. Thats makes no sense for me :(

      $ /usr/bin/time -f "%E" node main 1
      
      your machine has 8 cores.
      
      Done. Thread id: 1
      0:08.86
      $ /usr/bin/time -f "%E" node main 2
      
      your machine has 8 cores.
      
      Done. Thread id: 2
      Done. Thread id: 1
      0:13.96
      $ /usr/bin/time -f "%E" node main 3
      
      your machine has 8 cores.
      
      Done. Thread id: 2
      Done. Thread id: 1
      Done. Thread id: 3
      0:16.71
      $ /usr/bin/time -f "%E" node main 4
      
      your machine has 8 cores.
      
      Done. Thread id: 3
      Done. Thread id: 2
      Done. Thread id: 4
      Done. Thread id: 1
      0:21.87
      $ /usr/bin/time -f "%E" node main 5
      
      your machine has 8 cores.
      
      Done. Thread id: 3
      Done. Thread id: 2
      Done. Thread id: 5
      Done. Thread id: 1
      Done. Thread id: 4
      0:22.20
      $ /usr/bin/time -f "%E" node main 6
      
      your machine has 8 cores.
      
      Done. Thread id: 3
      Done. Thread id: 4
      Done. Thread id: 6
      Done. Thread id: 2
      Done. Thread id: 5
      Done. Thread id: 1
      0:23.74
      $ /usr/bin/time -f "%E" node main 7
      
      your machine has 8 cores.
      
      Done. Thread id: 3
      Done. Thread id: 4
      Done. Thread id: 7
      Done. Thread id: 2
      Done. Thread id: 5
      Done. Thread id: 1
      Done. Thread id: 6
      0:32.00
      $ /usr/bin/time -f "%E" node main 8
      
      your machine has 8 cores.
      
      Done. Thread id: 6
      Done. Thread id: 3
      Done. Thread id: 2
      Done. Thread id: 5
      Done. Thread id: 1
      Done. Thread id: 8
      Done. Thread id: 7
      Done. Thread id: 4
      0:35.92
      $ /usr/bin/time -f "%E" node main 9
      
      your machine has 8 cores.
      
      warning: number of requested threads (9) is higher than number of available cores (8)
      Done. Thread id: 8
      Done. Thread id: 4
      Done. Thread id: 6
      Done. Thread id: 9
      Done. Thread id: 2
      Done. Thread id: 3
      Done. Thread id: 7
      Done. Thread id: 5
      Done. Thread id: 1
      0:40.27
      

      顺便说一句,相关问题:为什么程序执行时间不同,多次运行同一个程序?

      BTW, related question: Why program execution time differs running the same program multiple times?

      推荐答案

      您遇到了称为 阿姆达尔定律.如果您有两倍的处理器,您通常不会获得两倍的计算吞吐量.为什么不呢?

      You're running up against a practical limit known as Ahmdahl's Law. If you have twice the processors you ordinarily don't get twice the computing throughput. Why not?

      几个原因,通常非常很难单独梳理和衡量,包括:

      Several reasons, typically very hard to tease out and measure separately, including:

      1. 启动和拆除并行进程(在本例中为 Javascript 工作线程)需要时间.
      2. 处理器之间必须相互协调,这需要上下文切换和处理器间通信.
      3. 处理器共享总线、RAM 和 SSD/硬盘驱动器访问等资源.一个处理器有时必须等待,而另一个处理器使用这些资源.
      4. (我对此并不完全确定):V8 Javascript 引擎需要在每个工作线程中花费一些时间来检测并及时编译热路径".在代码中.在这种情况下,worker 的操作是内存密集型的.之后,像您这样的代码可能主要在处理器寄存器中运行,这几乎违反了阿姆达尔定律.
      5. 有很多东西在笔记本电脑上运行.

      而且,笔记本电脑没有与怪物多核服务器相同的高耗能 RAM 和总线结构,因此竞争更糟.它们更多地针对桌面用例而设计,其中各种用户界面进程共享核心和超线程.

      And, laptops don't have the same power-intensive RAM and bus structures of monster multicore servers, so the contention is worse. They're designed more for the desktop use cases, where various user interface processes share the cores and hyperthreads.

      如果每个真实内核有一个工作线程,则您的主要 nodejs 进程也必须与您的线程共享.你的 Xorg 服务器、你的文件系统,以及 Focal Fossa 上所有那些非常有用的守护进程也是如此.

      If you have one worker thread per real core, your main nodejs process has to share with your threads too. As does your Xorg server, your file system, and all those dozens of really useful daemon processes on Focal Fossa.

      如果这对您的容量规划来说是一个关键问题,那么花几十欧元/美元并在其中一个云供应商上租用一台大型的 24 核或 32 核服务器来运行您的实验.使用您真正的信号处理工作负载.这是一个更有用的测试.如果您租用他们提供的最多内核,您可能会得到整台机器,而不会与其他客户共享.

      If this is a critical issue for your capacity planning, spend a few tens of euros / dollars and rent a big fat 24- or 32-core server on one of the cloud vendors to run your experiments. With your real signal-processing workload. That's a more useful test. If you rent the most cores they offer, you'll likely get the whole machine and not share it with other customers.

      不要浪费时间试图了解笔记本电脑主板上的省钱和省电快捷方式以及低劣的硬件技巧.

      Don't waste your time trying to understand the money- and power- saving shortcuts and sleazy hardware hacks in your laptop motherboard.

      (这位老前辈曾经在一家计算机公司的现场软件支持部门工作.我不得不一遍又一遍地向销售人员解释阿姆达尔定律,这样他们就不会过度推销该公司昂贵得离谱的新并行处理产品.他们还是超卖了.需要一些大客户要求退款才能教他们.)

      (This old-timer once worked in field software support for a computer company. I had to explain Ahmdahl's law to the sales people, over and over, so they wouldn't oversell the company's ridiculously expensive new parallel-processing products. They still did oversell them. It took some big customers demanding their money back to teach them.)

      这篇关于运行多个 nodejs 工作线程:为什么会有这么大的开销/延迟?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆