启用pthread时,C FFI回调的运行时性能下降 [英] Runtime performance degradation for C FFI Callback when pthreads are enabled
问题描述
我很好奇的情况下C FFI调用Haskell函数的情况下,使用线程
选项的GHC运行时的行为。我写了代码来测量基本函数回调的开销(下面)。虽然函数回调开销之前已经讨论过,但我很好奇在C代码中启用多线程(即使对Haskell的函数调用的总数保持不变)时观察到的总时间的急剧增加。在我的测试中,我使用两个场景(GHC 7.0.4,RHEL,12核心框,代码后面的运行时选项)调用Haskell函数 f
p>
-
C中的单线程
create_threads
function:callf 5M次 - 总时间1.32秒
-
C中的5个主题
create_threads
函数:每个线程调用f
1M次 - 所以总计仍是5M-总时间7.79秒
下面的代码 - 下面的Haskell代码是单线程C回调 - 评论解释如何更新它的5线程测试:
t.hs:
{ - #LANGUAGE BangPatterns# - }
import qualified Data.Vector.Storable as SV
import Control.Monad(mapM,mapM_)
import Foreign.Ptr(Ptr,FunPtr,freeHaskellFunPtr)
import Foreign.C.Types(CInt)
f :: CInt - > ()
fx =()
- wrapperimport是一个将Haskell函数转换为外部函数指针的转换器
外部import ccallwrapper
wrap ::(CInt - >()) - > IO(FunPtr(CInt - >()))
外部导入ccall安全mt.h create_threads
createThreads :: Ptr > Ptr CInt - > CInt→> IO()
main = do
- set threads = [1..5],l = 1000000用于多线程FFI回调测试
let threads = [1。 .1]
l = 5000000
vl = SV.replicate(length threads)(fromIntegral l) - 创建一个l的向量
lf < - mapM(\x - & f)threads - 将f包装成funPtr并创建一个列表
let vf = SV.fromList lf - 创建FunPtr的向量到f
- 将函数指针传递给f的向量, l to create_threads
- create_threads将产生线程(等于线程列表的长度)
- 每个pthread将回调fl次 - 然后我们可以检查开销
SV.unsafeWith vf $ \x - >
SV.unsafeWith vl $ \y - > createThreads xy(fromIntegral $ SV.length vl)
SV.mapM_ freeHaskellFunPtr vf
mt。 h:
#include< pthread.h>
#include< stdio.h>
typedef void(* FunctionPtr)(int);
/ ** struct用于将参数传递给线程
**
** /
typedef struct threadArgs {
int threadId;
FunctionPtr fn;
int length;
} threadArgs;
/ *这是我们的线程函数。它像main(),但是对于一个线程* /
void * threadFunc(void * arg);
void create_threads(FunctionPtr *,int *,int);
mt.c:
#includemt.h
/ *这是我们的线程函数。它是像main(),但是对于一个线程* /
void * threadFunc(void * arg)
{
FunctionPtr fn;
threadArgs args = *(threadArgs *)arg;
int id = args.threadId;
int length = args.length;
fn = args.fn;
int i;
for(i = 0; i fn(i ++); //调用haskell函数
}
}
void create_threads(FunctionPtr * fp,int * length,int numThreads)
{
pthread_t pth [numThreads ]; //这是我们的线程标识符
threadArgs args [numThreads];
int t;
for(t = 0; t args [t] .threadId = t;
args [t] .fn = *(fp + t);
args [t] .length = *(length + t);
pthread_create(& pth [t],NULL,threadFunc,& args [t]);
t ++;
}
for(t = 0; t pthread_join(pth [t],NULL);
}
printf(All threads terminated\\\
);
}
编译(GHC 7.0.4,gcc 4.4.3 by ghc):
$ ghc -O2 t.hs mt.c -lpthread -threaded -rtsopts -optc-O2
使用 create_threads
将会这样做) - 我关闭了并行gc进行测试:
$ ./t + RTS -s -N5 -g1
INIT时间0.00s(经过0.00s)
MUT时间1.04s(经过1.05s)
GC时间0.28秒(已过去0.28秒)
退出时间0.00s )
总时间1.32s(已过去1.34s)
%GC时间21.1%(已用时间21.2%)
运行5个线程(参见 main
中的第一个注释 t.hs
上面的5个线程如何编辑它):
$ ./t + RTS -s -N5 -g1
INIT时间0.00s(经过0.00s)
MUT时间7.42s(已过去2.27s)
GC时间0.36s(已过去0.37s)
退出时间0.00s )
总时间7.79s(已过去2.63s)
%GC时间4.7%(已过去13.9%)
我将深入了解为什么性能会下降与create_threads中的多个线程。我第一次怀疑并行GC,但我把它关闭测试上面。给定相同的运行时选项,多个线程的MUT时间也急剧上升。因此,它不只是GC。
此外,在这种情况下,GHC 7.4.1有什么改进吗?
我不打算经常从FFI回调Haskell,但是它在设计Haskell / C多线程库交互时有助于理解上述问题。
我相信这里的关键问题是,GHC运行时如何调度C回调到Haskell?虽然我不知道肯定,我的怀疑是所有的C回调由Haskell线程处理,最初作出外部调用,至少达到ghc-7.2.1(我正在使用)。
这将解释你(和我)从1个线程移动到5时的大的减速。如果5个线程都回调到同一个Haskell线程,会有重大的争用
为了测试这个,我修改了你的代码,使Haskell调用一个新的线程,然后调用 create_threads
和 create_threads
只会每次调用产生一个线程。如果我是正确的,每个操作系统线程将有一个专用的Haskell线程来执行工作,所以应该有更少的争用。虽然这仍然需要几乎两倍的单线程版本,它的速度明显比原来的多线程版本,这给这个理论的一些证据。如果我通过 + RTS -qm
关闭线程迁移,差异就小得多。
结果为ghc-7.2.2,我希望版本改变Haskell如何调度回调。也许 ghc-users 列表上的某人可以提供有关此方面的更多信息;我在7.2.2或7.4.1的发行说明中看不到任何可能。
I am curious about the behavior of GHC runtime with threaded
option in case when C FFI calls back Haskell function. I wrote code to measure overhead of a basic function callback (below). While the function callback overhead has already been discussed before, I am curious about the sharp increase in total time I observed when multi-threading is enabled in C code (even when total number of function calls to Haskell remain same). In my test, I called Haskell function f
5M times using two scenarios (GHC 7.0.4, RHEL, 12-core box, runtime options below after the code):
Single thread in C
create_threads
function: callf
5M times - Total time 1.32s5 threads in C
create_threads
function: each thread callsf
1M times - so, total is still 5M - Total time 7.79s
Code below - Haskell code below is for single-threaded C callback - comments explain how to update it for 5-thread testing:
t.hs:
{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Storable as SV
import Control.Monad (mapM, mapM_)
import Foreign.Ptr (Ptr, FunPtr, freeHaskellFunPtr)
import Foreign.C.Types (CInt)
f :: CInt -> ()
f x = ()
-- "wrapper" import is a converter for converting a Haskell function to a foreign function pointer
foreign import ccall "wrapper"
wrap :: (CInt -> ()) -> IO (FunPtr (CInt -> ()))
foreign import ccall safe "mt.h create_threads"
createThreads :: Ptr (FunPtr (CInt -> ())) -> Ptr CInt -> CInt -> IO()
main = do
-- set threads=[1..5], l=1000000 for multi-threaded FFI callback testing
let threads = [1..1]
l = 5000000
vl = SV.replicate (length threads) (fromIntegral l) -- make a vector of l
lf <- mapM (\x -> wrap f ) threads -- wrap f into a funPtr and create a list
let vf = SV.fromList lf -- create vector of FunPtr to f
-- pass vector of function pointer to f, and vector of l to create_threads
-- create_threads will spawn threads (equal to length of threads list)
-- each pthread will call back f l times - then we can check the overhead
SV.unsafeWith vf $ \x ->
SV.unsafeWith vl $ \y -> createThreads x y (fromIntegral $ SV.length vl)
SV.mapM_ freeHaskellFunPtr vf
mt.h:
#include <pthread.h>
#include <stdio.h>
typedef void(*FunctionPtr)(int);
/** Struct for passing argument to thread
**
**/
typedef struct threadArgs{
int threadId;
FunctionPtr fn;
int length;
} threadArgs;
/* This is our thread function. It is like main(), but for a thread*/
void *threadFunc(void *arg);
void create_threads(FunctionPtr*,int*,int);
mt.c:
#include "mt.h"
/* This is our thread function. It is like main(), but for a thread*/
void *threadFunc(void *arg)
{
FunctionPtr fn;
threadArgs args = *(threadArgs*) arg;
int id = args.threadId;
int length = args.length;
fn = args.fn;
int i;
for (i=0; i < length;){
fn(i++); //call haskell function
}
}
void create_threads(FunctionPtr* fp, int* length, int numThreads )
{
pthread_t pth[numThreads]; // this is our thread identifier
threadArgs args[numThreads];
int t;
for (t=0; t < numThreads;){
args[t].threadId = t;
args[t].fn = *(fp + t);
args[t].length = *(length + t);
pthread_create(&pth[t],NULL,threadFunc,&args[t]);
t++;
}
for (t=0; t < numThreads;t++){
pthread_join(pth[t],NULL);
}
printf("All threads terminated\n");
}
Compilation (GHC 7.0.4, gcc 4.4.3 in case it is used by ghc):
$ ghc -O2 t.hs mt.c -lpthread -threaded -rtsopts -optc-O2
Running with 1 thread in create_threads
(the code above will do that) - I turned off parallel gc for testing:
$ ./t +RTS -s -N5 -g1
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.04s ( 1.05s elapsed)
GC time 0.28s ( 0.28s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.32s ( 1.34s elapsed)
%GC time 21.1% (21.2% elapsed)
Running with 5 threads (see first comment in main
function of t.hs
above on how to edit it for 5 threads):
$ ./t +RTS -s -N5 -g1
INIT time 0.00s ( 0.00s elapsed)
MUT time 7.42s ( 2.27s elapsed)
GC time 0.36s ( 0.37s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 7.79s ( 2.63s elapsed)
%GC time 4.7% (13.9% elapsed)
I will appreciate insight into why the performance degrades with multiple pthreads in create_threads. I first suspected parallel GC but I turned it off for testing above. The MUT time too goes up sharply for multiple pthreads, given the same runtime options. So, it is not just GC.
Also, are there any improvements in GHC 7.4.1 for this kind of scenario?
I don't plan to call back Haskell from FFI that often, but it helps to understand the above issue, when designing Haskell/C mult-threaded library interaction.
I believe the key question here is, how does the GHC runtime schedule C callbacks into Haskell? Although I don't know for certain, my suspicion is that all C callbacks are handled by the Haskell thread that originally made the foreign call, at least up to ghc-7.2.1 (which I'm using).
This would explain the large slowdown you (and I) see when moving from 1 thread to 5. If the five threads are all calling back into the same Haskell thread, there will be significant contention on that Haskell thread to complete all the callbacks.
In order to test this, I modified your code so that Haskell forks a new thread before calling create_threads
, and create_threads
only spawns one thread per call. If I'm correct, each OS thread will have a dedicated Haskell thread to perform work, so there should be much less contention. Although this still takes almost twice as long as the single-thread version, it's significantly faster than the original multi-threaded version, which lends some evidence to this theory. The difference is much less if I turn off thread migration with +RTS -qm
.
As Daniel Fischer reports different results for ghc-7.2.2, I would expect that version changes how Haskell schedules callbacks. Maybe somebody on the ghc-users list can provide more information on this; I don't see anything likely in the release notes for 7.2.2 or 7.4.1.
这篇关于启用pthread时,C FFI回调的运行时性能下降的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!