启用pthread时,C FFI回调的运行时性能下降 [英] Runtime performance degradation for C FFI Callback when pthreads are enabled

查看:164
本文介绍了启用pthread时,C FFI回调的运行时性能下降的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很好奇的情况下C FFI调用Haskell函数的情况下,使用线程选项的GHC运行时的行为。我写了代码来测量基本函数回调的开销(下面)。虽然函数回调开销之前已经讨论过,但我很好奇在C代码中启用多线程(即使对Haskell的函数调用的总数保持不变)时观察到的总时间的急剧增加。在我的测试中,我使用两个场景(GHC 7.0.4,RHEL,12核心框,代码后面的运行时选项)调用Haskell函数 f p>


  • C中的单线程 create_threads function:call f 5M次 - 总时间1.32秒


  • C中的5个主题 create_threads 函数:每个线程调用 f 1M次 - 所以总计仍是5M-总时间7.79秒




下面的代码 - 下面的Haskell代码是单线程C回调 - 评论解释如何更新它的5线程测试:



t.hs:

  { - #LANGUAGE BangPatterns# - } 
import qualified Data.Vector.Storable as SV
import Control.Monad(mapM,mapM_)
import Foreign.Ptr(Ptr,FunPtr,freeHaskellFunPtr)
import Foreign.C.Types(CInt)

f :: CInt - > ()
fx =()

- wrapperimport是一个将Haskell函数转换为外部函数指针的转换器
外部import ccallwrapper
wrap ::(CInt - >()) - > IO(FunPtr(CInt - >()))

外部导入ccall安全mt.h create_threads
createThreads :: Ptr > Ptr CInt - > CInt→> IO()

main = do
- set threads = [1..5],l = 1000000用于多线程FFI回调测试
let threads = [1。 .1]
l = 5000000
vl = SV.replicate(length threads)(fromIntegral l) - 创建一个l的向量
lf < - mapM(\x - & f)threads - 将f包装成funPtr并创建一个列表
let vf = SV.fromList lf - 创建FunPtr的向量到f
- 将函数指针传递给f的向量, l to create_threads
- create_threads将产生线程(等于线程列表的长度)
- 每个pthread将回调fl次 - 然后我们可以检查开销
SV.unsafeWith vf $ \x - >
SV.unsafeWith vl $ \y - > createThreads xy(fromIntegral $ SV.length vl)
SV.mapM_ freeHaskellFunPtr vf

mt。 h:

  #include< pthread.h> 
#include< stdio.h>

typedef void(* FunctionPtr)(int);

/ ** struct用于将参数传递给线程
**
** /
typedef struct threadArgs {
int threadId;
FunctionPtr fn;
int length;
} threadArgs;


/ *这是我们的线程函数。它像main(),但是对于一个线程* /
void * threadFunc(void * arg);
void create_threads(FunctionPtr *,int *,int);

mt.c:

  #includemt.h


/ *这是我们的线程函数。它是像main(),但是对于一个线程* /
void * threadFunc(void * arg)
{
FunctionPtr fn;
threadArgs args = *(threadArgs *)arg;
int id = args.threadId;
int length = args.length;
fn = args.fn;
int i;
for(i = 0; i fn(i ++); //调用haskell函数
}
}

void create_threads(FunctionPtr * fp,int * length,int numThreads)
{
pthread_t pth [numThreads ]; //这是我们的线程标识符
threadArgs args [numThreads];
int t;
for(t = 0; t args [t] .threadId = t;
args [t] .fn = *(fp + t);
args [t] .length = *(length + t);
pthread_create(& pth [t],NULL,threadFunc,& args [t]);
t ++;
}

for(t = 0; t pthread_join(pth [t],NULL);
}
printf(All threads terminated\\\
);
}

编译(GHC 7.0.4,gcc 4.4.3 by ghc):

  $ ghc -O2 t.hs mt.c -lpthread -threaded -rtsopts -optc-O2 

使用 create_threads 将会这样做) - 我关闭了并行gc进行测试:

  $ ./t + RTS -s -N5 -g1 
INIT时间0.00s(经过0.00s)
MUT时间1.04s(经过1.05s)
GC时间0.28秒(已过去0.28秒)
退出时间0.00s )
总时间1.32s(已过去1.34s)

%GC时间21.1%(已用时间21.2%)

运行5个线程(参见 main 中的第一个注释 t.hs 上面的5个线程如何编辑它):

  $ ./t + RTS -s -N5 -g1 
INIT时间0.00s(经过0.00s)
MUT时间7.42s(已过去2.27s)
GC时间0.36s(已过去0.37s)
退出时间0.00s )
总时间7.79s(已过去2.63s)

%GC时间4.7%(已过去13.9%)

我将深入了解为什么性能会下降与create_threads中的多个线程。我第一次怀疑并行GC,但我把它关闭测试上面。给定相同的运行时选项,多个线程的MUT时间也急剧上升。因此,它不只是GC。



此外,在这种情况下,GHC 7.4.1有什么改进吗?



我不打算经常从FFI回调Haskell,但是它在设计Haskell / C多线程库交互时有助于理解上述问题。

解决方案

我相信这里的关键问题是,GHC运行时如何调度C回调到Haskell?虽然我不知道肯定,我的怀疑是所有的C回调由Haskell线程处理,最初作出外部调用,至少达到ghc-7.2.1(我正在使用)。



这将解释你(和我)从1个线程移动到5时的大的减速。如果5个线程都回调到同一个Haskell线程,会有重大的争用



为了测试这个,我修改了你的代码,使Haskell调用一个新的线程,然后调用 create_threads create_threads 只会每次调用产生一个线程。如果我是正确的,每个操作系统线程将有一个专用的Haskell线程来执行工作,所以应该有更少的争用。虽然这仍然需要几乎两倍的单线程版本,它的速度明显比原来的多线程版本,这给这个理论的一些证据。如果我通过 + RTS -qm 关闭线程迁移,差异就小得多。



结果为ghc-7.2.2,我希望版本改变Haskell如何调度回调。也许 ghc-users 列表上的某人可以提供有关此方面的更多信息;我在7.2.2或7.4.1的发行说明中看不到任何可能。


I am curious about the behavior of GHC runtime with threaded option in case when C FFI calls back Haskell function. I wrote code to measure overhead of a basic function callback (below). While the function callback overhead has already been discussed before, I am curious about the sharp increase in total time I observed when multi-threading is enabled in C code (even when total number of function calls to Haskell remain same). In my test, I called Haskell function f 5M times using two scenarios (GHC 7.0.4, RHEL, 12-core box, runtime options below after the code):

  • Single thread in C create_threads function: call f 5M times - Total time 1.32s

  • 5 threads in C create_threads function: each thread calls f 1M times - so, total is still 5M - Total time 7.79s

Code below - Haskell code below is for single-threaded C callback - comments explain how to update it for 5-thread testing:

t.hs:

{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Storable as SV
import Control.Monad (mapM, mapM_)
import Foreign.Ptr (Ptr, FunPtr, freeHaskellFunPtr)
import Foreign.C.Types (CInt)

f :: CInt -> ()
f x = ()

-- "wrapper" import is a converter for converting a Haskell function to a foreign function pointer
foreign import ccall "wrapper"
  wrap :: (CInt -> ()) -> IO (FunPtr (CInt -> ()))

foreign import ccall safe "mt.h create_threads"
  createThreads :: Ptr (FunPtr (CInt -> ())) -> Ptr CInt -> CInt -> IO()

main = do
  -- set threads=[1..5], l=1000000 for multi-threaded FFI callback testing
  let threads = [1..1]
      l = 5000000
      vl = SV.replicate (length threads) (fromIntegral l) -- make a vector of l
  lf <- mapM (\x -> wrap f ) threads -- wrap f into a funPtr and create a list
  let vf = SV.fromList lf -- create vector of FunPtr to f
  -- pass vector of function pointer to f, and vector of l to create_threads
  -- create_threads will spawn threads (equal to length of threads list)
  -- each pthread will call back f l times - then we can check the overhead
  SV.unsafeWith vf $ \x ->
    SV.unsafeWith vl $ \y -> createThreads x y (fromIntegral $ SV.length vl)
  SV.mapM_ freeHaskellFunPtr vf

mt.h:

#include <pthread.h>
#include <stdio.h>

typedef void(*FunctionPtr)(int);

/** Struct for passing argument to thread
**
**/
typedef struct threadArgs{
   int  threadId;
   FunctionPtr fn;
   int length;
} threadArgs;


/* This is our thread function.  It is like main(), but for a thread*/
void *threadFunc(void *arg);
void create_threads(FunctionPtr*,int*,int);

mt.c:

#include "mt.h"


/* This is our thread function.  It is like main(), but for a thread*/
void *threadFunc(void *arg)
{
  FunctionPtr fn;
  threadArgs args = *(threadArgs*) arg;
  int id = args.threadId;
  int length = args.length;
  fn = args.fn;
  int i;
  for (i=0; i < length;){
    fn(i++); //call haskell function
  }
}

void create_threads(FunctionPtr* fp, int* length, int numThreads )
{
  pthread_t pth[numThreads];  // this is our thread identifier
  threadArgs args[numThreads];
  int t;
  for (t=0; t < numThreads;){
    args[t].threadId = t;
    args[t].fn = *(fp + t);
    args[t].length = *(length + t);
    pthread_create(&pth[t],NULL,threadFunc,&args[t]);
    t++;
  }

  for (t=0; t < numThreads;t++){
    pthread_join(pth[t],NULL);
  }
  printf("All threads terminated\n");
}

Compilation (GHC 7.0.4, gcc 4.4.3 in case it is used by ghc):

 $ ghc -O2 t.hs mt.c -lpthread -threaded -rtsopts -optc-O2

Running with 1 thread in create_threads (the code above will do that) - I turned off parallel gc for testing:

$ ./t +RTS -s -N5 -g1
INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    1.04s  (  1.05s elapsed)
  GC    time    0.28s  (  0.28s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    1.32s  (  1.34s elapsed)

  %GC time      21.1%  (21.2% elapsed)

Running with 5 threads (see first comment in main function of t.hs above on how to edit it for 5 threads):

$ ./t +RTS -s -N5 -g1
INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    7.42s  (  2.27s elapsed)
  GC    time    0.36s  (  0.37s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    7.79s  (  2.63s elapsed)

  %GC time       4.7%  (13.9% elapsed)

I will appreciate insight into why the performance degrades with multiple pthreads in create_threads. I first suspected parallel GC but I turned it off for testing above. The MUT time too goes up sharply for multiple pthreads, given the same runtime options. So, it is not just GC.

Also, are there any improvements in GHC 7.4.1 for this kind of scenario?

I don't plan to call back Haskell from FFI that often, but it helps to understand the above issue, when designing Haskell/C mult-threaded library interaction.

解决方案

I believe the key question here is, how does the GHC runtime schedule C callbacks into Haskell? Although I don't know for certain, my suspicion is that all C callbacks are handled by the Haskell thread that originally made the foreign call, at least up to ghc-7.2.1 (which I'm using).

This would explain the large slowdown you (and I) see when moving from 1 thread to 5. If the five threads are all calling back into the same Haskell thread, there will be significant contention on that Haskell thread to complete all the callbacks.

In order to test this, I modified your code so that Haskell forks a new thread before calling create_threads, and create_threads only spawns one thread per call. If I'm correct, each OS thread will have a dedicated Haskell thread to perform work, so there should be much less contention. Although this still takes almost twice as long as the single-thread version, it's significantly faster than the original multi-threaded version, which lends some evidence to this theory. The difference is much less if I turn off thread migration with +RTS -qm.

As Daniel Fischer reports different results for ghc-7.2.2, I would expect that version changes how Haskell schedules callbacks. Maybe somebody on the ghc-users list can provide more information on this; I don't see anything likely in the release notes for 7.2.2 or 7.4.1.

这篇关于启用pthread时,C FFI回调的运行时性能下降的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆