OpenMP与MSVC 2010 Debug构建奇怪的错误,当对象被复制 [英] OpenMP with MSVC 2010 Debug build strange bug when object are copied

查看:343
本文介绍了OpenMP与MSVC 2010 Debug构建奇怪的错误,当对象被复制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当复杂的程序,当在MSVC 2010调试模式下使用OpenMP构建时运行到奇怪的行为。我已尽力构建以下最小工作示例(尽管它不是真的最小),其中minic的真正的程序的结构。

  #include< vector> 
#include< cassert>

//类取指向整个集合和位置只允许访问
//到该位置的元素。它提供只读访问来查询整个集合的一些
//信息
class Element
{
public:

Element(int i, std :: vector< double> * src):i_(i),src_(src){}

int i()const {return i_;}
int size return src _-> size();}

double src()const {return(* src _)[i_];}
double& src(){return(* src_) [i_];}

private:

const int i_;
std :: vector< double> * const src_;
};

//分派的基类
template< typename Derived>
class Base
{
protected:

void eval(int dim,Element elem,double * res)
{
//来自Evaluation< Derived>
eval_dispatch(dim,elem,res,& Derived :: eval); // Point(2)
}

private:

//解析到非静态成员eval(...)
template< ;类型名D>
void eval_dispatch(int dim,Element elem,double * res,
void(D :: *)(int,Element,double *))
{
#ifndef NDEBUG / /断言这是一个Derived对象
assert((dynamic_cast< Derived *>(this)));
#endif
static_cast< Derived *>(this) - > eval(dim,elem,res);
}

//解析到静态成员eval(...)
void eval_dispatch(int dim,Element elem,double * res,
void(* )(int,Element,double *))
{
Derived :: eval(dim,elem,res); // Point(3)
}

// Resolve到Base成员eval(...),Derived没有这个成员,但是派生出
//从Base
void eval_dispatch(int dim,Element elem,double * res,
void(Base :: *)(int,Element,double *))
{
//默认行为:do nothing
}
};

//中间人提供接口operator(),调用Base :: eval和
// Base将其分派到可能的默认行为或Derived :: eval
template< typename Derived>
class Evaluator:public Base< Derived>
{
public:

void operator()(int N,int dim,double * res)
{
std :: vector< double> src(N);
for(int i = 0; i src [i] = i;

#pragma omp parallel for default(none)shared(N,dim,src,res)
for(int i = 0; i< N; ++ i){
assert(i double * r = res + i * dim;
元素elem(i,& src);
assert(elem.i()== i); // Point(1)
this-> eval(dim,elem,r);
}
}
};

//客户端代码,实现eval
类实现:public Evaluator< Implementation>
{
public:

static void eval(int dim,Element elem,double * r)
{
assert(elem.i ; elem.size()); //这是程序失败的地方Point(4)
for(int d = 0; d!= dim; ++ d)
r [d] = elem.src
}
};

int main()
{
const int N = 500000;
const int Dim = 2;
double * res = new double [N * Dim];
实现impl;
impl(N,Dim,res);
delete [] res;

return 0;
}

真正的程序没有 c $ c>等等。但是 Element Base Evaluator 实现捕获真正程序的基本结构。当在调试模式下构建并运行调试器时,断言在 Point(4)失败。



通过查看调用堆栈,调试信息的一些更多细节



在输入 Point(1) local i 有值 371152 ,这很好。变量 elem 不会显示在框架中,这是有点奇怪。但是因为 Point(1)的断言不容易,我想这很好。



事情发生了。由 Evaluator eval 的调用解析为其基类,因此 Point(2) 被执行。此时,调试器显示 elem i_ = 499999 ,这不再是<$ c $用于在传递值之前在 Evaluator 中创建 elem Base :: eval 。下一点,它解析为 Point(3),这次, elem i_ = 501682 ,超出范围,这是当调用指向 Point(4)并且失败的值。 / p>

Element 对象通过值传递时,其成员的值会发生变化。重复运行程序多次,类似的行为发生,虽然不总是可重现的。在实际程序中,这个类被设计为像迭代器,它迭代一个粒子集合。虽然它迭代的东西不像一个容器的exaclty。但无论如何,关键是它是足够小,有效地通过价值。因此,客户端代码知道它有自己的 Element 的副本,而不是一些引用或指针,并且不需要担心线程安全(很多)只要他坚持 Element 的界面,它只提供对整个集合的单个位置的写访问。



我试过与GCC和英特尔ICPC相同的程序。没有发生不期望的事情。在实际程序中,产生正确的结果。



我在某处使用了OpenMP吗?我认为在 Point(1)创建的 elem 应该是循环体的局部。此外,在整个程序中,没有产生大于 N 的值,那么这些新值来自哪里?



编辑



我仔细看了调试器,显示 elem.i _ elem 通过值传递时,code>被更改,指针 elem.src _



b $ b

我使用CMake生成MSVC解决方案。我不得不承认,我不知道如何使用MSVC或Windows一般。我使用它的唯一原因是,我知道很多人使用它,所以我想测试我的库对它解决任何问题。



CMake生成的项目,使用 Visual Studio 10 Win64 target,编译器标志显示为
/ DWIN32 / D_WINDOWS / W3 / Zm1000 / EHsc / D_DEBUG / MDd / Zi / Ob0 / Od / RTC1 这里是属性页中找到的命令行-C / C ++ - 命令行
/ Zi / nologo / W3 / WX- / Od / Ob0 / DWIN32/ D_WINDOWS/ D_DEBUG/ DCMAKE_INTDIR = \Debug \/ D_MBCS/ Gm- / EHsc / RTC1 / MDd / GS / fp:precise / Zc:wchar_t / Zc:forScope / GR / openmp /Fp\"TestOMP.dir\Debug\TestOMP.pch/ FaDebug/Fo\"TestOMP.dir\Debug\ / FdC:/ Users / Yan Zhou / Dropbox / Build / TestOMP / build / Debug / TestOMP.pdb/ Gd / TP / errorReport:queue



这里有什么不好?

解决方案

显然,MSVC中的64位OpenMP实现与代码不兼容



为了调试您的问题,我修改了您的代码以将迭代次数保存到 threadprivate this-> eval()之前调用全局变量,然后在 Implementation :: eval 查看保存的迭代数是否与 elem.i _ 不同:

  static int _iter; 
#pragma omp threadprivate(_iter)

...
#pragma omp parallel for default(none)shared(N,dim,src,res)
(int i = 0; i assert(i double * r = res + i * dim;
元素elem(i,& src);
assert(elem.i()== i); // Point(1)
_iter = i; //保存迭代数
this-> eval(dim,elem,r);
}
}
...

...
static void eval(int dim,Element elem,double * r)
{
//检查差异
if(elem.i()!= _iter)
printf([%d] _iter =%x!=%x \\\
,omp_get_thread_num (),_iter,elem.i());
assert(elem.i()< elem.size()); //这是程序失败的地方Point(4)
for(int d = 0; d!= dim; ++ d)
r [d] = elem.src
}
...

c $ c> elem.i _ 成为不同线程中传递的值的一个坏混合 void eval_dispatch(int dim,Element elem,double * res,void (int,Element,double *))。这在每次运行中都会发生次数,但只有当 elem.i _ 的值变得足够大以触发断言时才会看到它。有时,发生混合值不超过容器的大小,然后代码完成执行,而不断言。此外,在调试会话中断言后看到的是VS调试器无法正确处理多线程代码:)



这只发生在未优化64位模式。它不会发生在32位代码(调试和发布)。除非禁用优化,否则它也不会发生在64位释放代码中。 如果在关键部分调用 this-> eval(),也不会发生这种情况:

  #pragma omp parallel for default(none)shared(N,dim,src,res)
for(int i = 0; i ...
#pragma omp critical
this-> eval(dim,elem,r);
}
}

但这样做会取消OpenMP的好处。这表明,进一步向下的调用链是以不安全的方式执行。我检查了汇编代码,但找不到确切的原因。我真的很困惑,因为MSVC实现 Element 类的隐式复制构造函数使用简单的逐位复制(它甚至是内联),所有操作都在堆栈上完成。 p>

这让我想起了一个事实,即Sun(现在的Oracle)编译器坚持,如果一个OpenMP支持,它应该提高优化水平。不幸的是,MSDN中 / openmp 选项的文档没有提到可能来自错误优化级别的干扰。这也可能是一个错误。如果我可以访问一个版本的VS,我应该测试另一个版本。



编辑:我深入承诺并运行代码在英特尔Parallel检查员2011年。它发现一个数据竞争模式预期。显然,当这行执行时:

  this-> eval(dim,elem,r); 

创建 elem 的临时副本按照Windows x64 ABI所需的方式将地址传递到 eval()方法。这里有奇怪的事:这个临时副本的位置不在实现并行区域的函数堆栈(MSVC编译器调用它 Evaloator $ omp $ 1< Implementation> :: operator() ),而是将其地址作为funclet的第一个参数。因为这个参数在所有线程中都是一样的,这意味着进一步传递给 this-> eval()的临时副本实际上在所有线程之间共享,这是可笑的,但仍然是真实的,因为人们可以很容易地观察到:

  ... 
void eval元素elem,double * res)
{
printf([%d] In Base :: eval()& elem =%p\\\
,omp_get_thread_num(),& elem);
//调度Evaluation< Derived>
eval_dispatch(dim,elem,res,& Derived :: eval); //点(2)
}
...

...
#pragma omp parallel for default(none)shared(N,dim,src, res)
for(int i = 0; i ...
元素elem(i,&src)
...
printf([%d]在并行区域& elem =%p \\\
,omp_get_thread_num(),& elem);
this-> eval(dim,elem,r);
}
}
...

运行此代码类似于此的输出:

  [0] parallel region& elem = 000000000030F348(a)
[0] Base :: eval()& elem = 000000000030F630
[0] parallel region& elem = 000000000030F348(a)
[0] Base :: eval()& elem = 000000000030F630
[1] Parallel region& elem = 000000000292F9B8(b)
[1] Base :: eval()& elem = 000000000030F630 <----!
[1] Parallel region& elem = 000000000292F9B8(b)
[1] Base :: eval()& elem = 000000000030F630&

如预期 elem 线程执行并行区域(分(a)(b))。但请注意,传递给 Base :: eval()的临时副本在每个线程中具有相同的地址。我相信这是一个编译器错误,使 Element 的隐式复制构造函数使用共享变量。这可以通过查看传递给 Base :: eval()的地址来容易地验证 - 它位于 N src ,即在共享变量块中。进一步检查汇编源显示,临时地址的地址确实作为参数传递给 _vcomp_fork()函数 vcomp100.dll ,它实现了OpenMP fork / join模型的fork部分。



因为基本上没有编译器选项可以影响这个行为,除了启用优化导致 Base :: eval() Base :: eval_dispatch()实现:: eval()都是内联的,因此没有临时副本 elem ,我发现唯一的解决方法是:

1)将 Element elem 参数设置为 Base :: eval code> a reference:

  void eval(int dim,Element& elem,double * res)
{
eval_dispatch(dim,elem,res,& Derived :: eval); // Point(2)
}

这样可以确保<$ c在 Evaluator< Implementation> :: operator()中实现并行区域的funclet的堆栈中的$ c> elem 共享临时副本。这进一步通过值作为另一个临时副本到 Base :: eval_dispatch(),但它保留其正确的值,因为这个新的临时副本在堆栈<$ c $



2)提供一个显式的拷贝构造函数到 Element

  Element(const Element& e):i_(e.i_) src_(e.src_){} 



我建议你使用显式复制构造函数不需要在源代码中进一步更改。



显然,此行为也出现在MSVS 2008中。我必须检查它是否也出现在MSVS 2012中,并且可能用MS提交错误报告。



这个错误不会显示在32位代码中,因为值对象传递的每个值都被推送到调用堆栈,不仅是指向它的指针。


I have a fairly complex program that runs into strange behavior when build with OpenMP in MSVC 2010 Debug mode. I have tried my best to construct the following minimal working example (though it is not really minimal) which minic the structure of the real program.

#include <vector>
#include <cassert>

// A class take points to the whole collection and a position Only allow access
// to the elements at that posiiton. It provide read-only access to query some
// information about the whole collection
class Element
{
    public :

    Element (int i, std::vector<double> *src) : i_(i), src_(src) {}

    int i () const {return i_;}
    int size () const {return src_->size();}

    double src () const {return (*src_)[i_];}
    double &src () {return (*src_)[i_];}

    private :

    const int i_;
    std::vector<double> *const src_;
};

// A Base class for dispatch
template <typename Derived>
class Base
{
    protected :

    void eval (int dim, Element elem, double *res)
    {
        // Dispatch the call from Evaluation<Derived>
        eval_dispatch(dim, elem, res, &Derived::eval); // Point (2)
    }

    private :

    // Resolve to Derived non-static member eval(...)
    template <typename D>
    void eval_dispatch(int dim, Element elem, double *res,
            void (D::*) (int, Element, double *))
    {
#ifndef NDEBUG // Assert that this is a Derived object
        assert((dynamic_cast<Derived *>(this)));
#endif
        static_cast<Derived *>(this)->eval(dim, elem, res);
    }

    // Resolve to Derived static member eval(...)
    void eval_dispatch(int dim, Element elem, double *res,
            void (*) (int, Element, double *))
    {
        Derived::eval(dim, elem, res); // Point (3)
    }

    // Resolve to Base member eval(...), Derived has no this member but derived
    // from Base
    void eval_dispatch(int dim, Element elem, double *res,
            void (Base::*) (int, Element, double *))
    {
        // Default behavior: do nothing
    }
};

// A middle-man who provides the interface operator(), call Base::eval, and
// Base dispatch it to possible default behavior or Derived::eval
template <typename Derived>
class Evaluator : public Base<Derived>
{
    public :

    void operator() (int N , int dim, double *res)
    {
        std::vector<double> src(N);
        for (int i = 0; i < N; ++i)
            src[i] = i;

#pragma omp parallel for default(none) shared(N, dim, src, res)
        for (int i = 0; i < N; ++i) {
            assert(i < N);
            double *r = res + i * dim;
            Element elem(i, &src);
            assert(elem.i() == i); // Point (1)
            this->eval(dim, elem, r);
        }
    }
};

// Client code, who implements eval
class Implementation : public Evaluator<Implementation>
{
    public :

    static void eval (int dim, Element elem, double *r)
    {
        assert(elem.i() < elem.size()); // This is where the program fails Point (4)
        for (int d = 0; d != dim; ++d)
            r[d] = elem.src();
    }
};

int main ()
{
    const int N = 500000;
    const int Dim = 2;
    double *res = new double[N * Dim];
    Implementation impl;
    impl(N, Dim, res);
    delete [] res;

    return 0;
}

The real program does not have vector etc. But the Element, Base, Evaluator and Implementation captures the basic structure of the real program. When build in Debug mode, and running the debugger, the assertion fails at Point (4).

Here is some more details of the debug informations, by viewing the call stacks,

At entering Point (1), the local i has value 371152, which is fine. The variable elem does not shown up in the frame, which is a little strange. But since the assertion at Point (1) does not faile, I guess it is fine.

Then, crazy things happened. The call to eval by Evaluator resolves to its base class, and so Point (2) was exectuted. At this point, the debugers shows that the elem has i_ = 499999, which is no longer the i used to create elem in Evaluator before passing it by value to Base::eval. The next point, it resolves to Point (3), this time, elem has i_ = 501682, which is out of range, and this is the value when the call is directed to Point (4) and failed the assertion.

It looks like whenever Element object is passed by value, the value of its members are changed. Rerun the program multiple times, similar behaviors happens though not always reproducible. In the real program, this class is designed to like an iterator, which iterate over a collection of particles. Though the thing it iterate is not exaclty like a container. But anyway, the point is that it is small enough to be efficiently passed by value. And therefore, the client code, knows that it has its own copy of Element instead of some reference or pointer, and does not need to worry about thread-safe (much) as long as he sticks with Element's interface, which only provide write access to a single position of the whole collection.

I tried the same program with GCC and Intel ICPC. Nothing un-expected happens. And in the real program, correct results where produced.

Did I used OpenMP wrongly somewhere? I thought that the elem created at about Point (1) shall be local to the loop body. In addition, in the whole program, no value bigger than N was produced, so where does the those new value comes from?

Edit

I looked more carefully into the debugger, it shows that while elem.i_ was changed when elem was passed by value, the pointer elem.src_ does not change with it. It has the same value (of the memory address) after passed by value

Edit: Compiler flags

I used CMake to generate the MSVC solution. I have to confess I have no idea how to use MSVC or Windows in general. The only reason I am using it is that I know a lot of people use it so I want to test my library against it to workaround any problems.

The CMake generated project, using Visual Studio 10 Win64 target, the compiler flags appears to be /DWIN32 /D_WINDOWS /W3 /Zm1000 /EHsc /GR /D_DEBUG /MDd /Zi /Ob0 /Od /RTC1 And here is the command line found in Property Pages-C/C++-Command Line /Zi /nologo /W3 /WX- /Od /Ob0 /D "WIN32" /D "_WINDOWS" /D "_DEBUG" /D "CMAKE_INTDIR=\"Debug\"" /D "_MBCS" /Gm- /EHsc /RTC1 /MDd /GS /fp:precise /Zc:wchar_t /Zc:forScope /GR /openmp /Fp"TestOMP.dir\Debug\TestOMP.pch" /Fa"Debug" /Fo"TestOMP.dir\Debug\" /Fd"C:/Users/Yan Zhou/Dropbox/Build/TestOMP/build/Debug/TestOMP.pdb" /Gd /TP /errorReport:queue

Is there anything suspecious here?

解决方案

Apparently the 64-bit OpenMP implementation in MSVC is not compatible with code, compiled without optimisations.

To debug your issue, I've modified your code to save the iteration number to a threadprivate global variable just before the call to this->eval() and then added a check at the beginning of Implementation::eval() to see if the saved iteration number differs from elem.i_:

static int _iter;
#pragma omp threadprivate(_iter)

...
#pragma omp parallel for default(none) shared(N, dim, src, res)
    for (int i = 0; i < N; ++i) {
        assert(i < N);
        double *r = res + i * dim;
        Element elem(i, &src);
        assert(elem.i() == i); // Point (1)
        _iter = i;             // Save the iteration number
        this->eval(dim, elem, r);
    }
}
...

...
static void eval (int dim, Element elem, double *r)
{
    // Check for difference
    if (elem.i() != _iter)
        printf("[%d] _iter=%x != %x\n", omp_get_thread_num(), _iter, elem.i());
    assert(elem.i() < elem.size()); // This is where the program fails Point (4)
    for (int d = 0; d != dim; ++d)
        r[d] = elem.src();
}
...

It appears that randomly the value of elem.i_ becomes a bad mixture of the values passed in the different threads to void eval_dispatch(int dim, Element elem, double *res, void (*) (int, Element, double *)). This happens hunderds of times in each run but you only see it once the value of elem.i_ becomes large enough to trigger the assertion. Sometimes it happens that the mixed value does not exceed the size of the container and then the code completes execution without assertion. Also what you see during the debug session after the assertion is the inability of the VS debugger to cope properly with the multithreaded code :)

This only happens in unoptimised 64-bit mode. It doese not happen in 32-bit code (both debug and release). It also does not happen in 64-bit release code unless optimisations are disabled. It also does not happen if one puts the call to this->eval() in a critical section:

#pragma omp parallel for default(none) shared(N, dim, src, res)
    for (int i = 0; i < N; ++i) {
        ...
#pragma omp critical
        this->eval(dim, elem, r);
    }
}

but doing this would cancel the benefits of OpenMP. This shows that something further down the call chain is performed in unsafe way. I've examined the assembly code but couldn't find the exact reason. I'm really puzzled since MSVC implements the implicit copy constructor of the Element class using simple bitwise copy (it is even inline) and all operations are done on the stack.

This reminds me of the fact that the Sun's (now Oracle's) compiler insists that it should crank up the level of optimisation if one enables OpenMP support. Unfortunately the documentation of /openmp option in MSDN says nothing about possible intereference that might come from the "wrong" optimisation level. This also might be a bug. I should test with another version of VS if I can access one.

Edit: I dug deeper as promised and run the code in Intel Parallel Inspector 2011. It found one data race pattern as expected. Apparently when this line is executed:

this->eval(dim, elem, r);

a temporary copy of elem is created and passed by address to the eval() method as is required by the Windows x64 ABI. And here comes the strange thing: the location of this temporary copy is not on the stack of the funclet that implements the parallel region (MSVC compiler calles it Evaluator$omp$1<Implementation>::operator() by the way) as one would expect but rather its address is taken as the first argument of the funclet. As this argument is one and the same in all threads, it means that the temporary copy that gets further passed to this->eval() is actually shared among all threads, which is ridiculous but is still true as one can easily observe:

...
void eval (int dim, Element elem, double *res)
{
    printf("[%d] In Base::eval()    &elem = %p\n", omp_get_thread_num(), &elem);
    // Dispatch the call from Evaluation<Derived>
    eval_dispatch(dim, elem, res, &Derived::eval); // Point (2)
}
...

...
#pragma omp parallel for default(none) shared(N, dim, src, res)
    for (int i = 0; i < N; ++i) {
        ...
        Element elem(i, &src);
        ...
        printf("[%d] In parallel region &elem = %p\n", omp_get_thread_num(), &elem);
        this->eval(dim, elem, r);
    }
}
...

Running this code produces an output similar to this:

[0] Parallel region &elem = 000000000030F348 (a)
[0] Base::eval()    &elem = 000000000030F630
[0] Parallel region &elem = 000000000030F348 (a)
[0] Base::eval()    &elem = 000000000030F630
[1] Parallel region &elem = 000000000292F9B8 (b)
[1] Base::eval()    &elem = 000000000030F630 <---- !!
[1] Parallel region &elem = 000000000292F9B8 (b)
[1] Base::eval()    &elem = 000000000030F630 <---- !!

As expected elem has different addresses in each thread executing the parallel region (points (a) and (b)). But observe that the temporary copy that gets passed to Base::eval() has the same address in each thread. I believe that this is a compiler bug that makes the implicit copy constructor of Element use a shared variable. This could be easily validated by looking at the address passed to Base::eval() - it lies somewhere between the address of N and that of src, i.e. in the shared variables block. Further inspection of the assembly source reveals that indeed the address of the temporary place is passed as argument to the _vcomp_fork() function from vcomp100.dll that implements the fork part of the OpenMP fork/join model.

Since basically there are no compiler options that can influence this behaviour apart from enabling optimisations which leads to Base::eval(), Base::eval_dispatch(), and Implementation::eval() all being inlined and hence no temporary copies of elem are ever made, the only work-arounds that I have found are:

1) Make the Element elem argument to Base::eval() a reference:

void eval (int dim, Element& elem, double *res)
{
    eval_dispatch(dim, elem, res, &Derived::eval); // Point (2)
}

This ensures that the local copy of elem in the stack of the funclet that implements the parallel region in Evaluator<Implementation>::operator() is passed and not the shared temporary copy. This gets further passed by value as another temporary copy to Base::eval_dispatch() but it retains its correct value as this new temporary copy is in the stack of Base::eval() and not in the shared variables block.

2) Provide an explicit copy constructor to Element:

Element (const Element& e) : i_(e.i_), src_(e.src_) {}

I would recommend that you go with the explicit copy constructor as it does not require further changes in the source code.

Apparently this behaviour is also present in MSVS 2008. I would have to check if it is also present in MSVS 2012 and possibly file a bug report with MS.

This bug does not show in 32-bit code as there the whole value of each passed by value object is pushed on the call stack and not only a pointer to it.

这篇关于OpenMP与MSVC 2010 Debug构建奇怪的错误,当对象被复制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆