具有约束的专业化 [英] Specialization with Constraints

查看:130
本文介绍了具有约束的专业化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有问题让GHC专门研究一个带有类约束的函数。我在这里有一个我的问题的最小例子: Foo.hs Main.hs 。这两个文件会被编译(GHC 7.6.2, ghc -O3 Main )并运行。

注意:
Foo.hs 真的被剥离下来。如果你想知道为什么需要约束,你可以在这里看到更多的代码。如果我将代码放入单个文件或进行其他很小的更改,GHC只需将 plusFastCyc 的调用内联。这不会在真实代码中发生,因为即使标记为 INLINE ,GHC内联也会使 plusFastCyc 过大。关键是专门调用 plusFastCyc ,而不是内联它。 plusFastCyc 在真实代码中的很多地方被调用,所以即使我强制GHC这样做,复制这样一个大函数也是不理想的。



感兴趣的代码是 Foo.hs 中的 plusFastCyc ,转载于此处:

  { - #INLINEABLE plusFastCyc# - } 
{ - #SPECIALIZE plusFastCyc ::
forall m。 (Factored m Int)=>
(FastCyc(VT U.Vector m)Int) - >
(FastCyc(VT U.Vector m)Int) - >
(FastCyc(VT U.Vector m)Int)# - }

- 虽然接下来的专业化使'fcTest'变得更快,
- 对于我在我的真实程序中,因为幻像类型M被指定为
- { - #SPECIALIZE plusFastCyc ::
- FastCyc(VT U.Vector M)Int - >
- FastCyc(VT U.Vector M)Int - >
- FastCyc(VT U.Vector M)Int# - }

plusFastCyc ::(Num(t r))=> (FastCyc t r) - > (FastCyc t r) - > (FastCyc tr)
plusFastCyc(PowBasis v1)(PowBasis v2)= PowBasis $ v1 + v2



< Main.hs 文件有两个驱动程序: vtTest ,运行时间约为3秒, fcTest ,当用-O3使用 forall 'd专业化进行编译时,其运行时间大约为83秒。



核心显示对于 vtTest test,附加代码专用于在 Int s等中的 Unboxed 向量,而泛型向量代码用于 fcTest
在第10行,您可以看到GHC确实编写了一个专门版本的 plusFastCyc ,与167行的通用版本相比。
我认为这条规则应该在第270行触发。( main6 调用 iterate main8 y ,所以 main8 是其中 plusFastCyc 应该是专用的。)



我的目标是通过专门化 plusFastCyc fcTest vtTest 一样快$ C>。我发现了两种方法:


  1. Explicity调用 inline from GHC.Exts 位于 fcTest

  2. 移除 Factored m Int 约束于 plusFastCyc

选项1并不令人满意,因为在实际的代码库中 plusFastCyc 是一个经常使用的操作和一个 大型函数,所以不应该在每个使用。相反,GHC应该调用 plusFastCyc 的专用版本。选项2并不是真正的选择,因为我需要实际代码中的约束。



我尝试了各种使用(而不是使用) INLINE INLINABLE SPECIALIZE ,但似乎没有任何效果。 (编辑:我可能已经删除了太多 plusFastCyc 以使我的示例变小,因此 INLINE 可能会导致函数被内联,这在我的真实代码中不会发生,因为 plusFastCyc 非常大。)在这个特定的例子中,我没有任何 match_co:需要更多病例 RULE: LHS太复杂了,不能解除 (和 here )警告,尽管在最小化示例之前,我得到了许多 match_co 警告。据推测,问题是规则中的 Factored m Int 约束条件;如果我对该约束进行了更改, fcTest 的运行速度与 vtTest 一样快。



我正在做什么GHC只是不喜欢?为什么GHC不会专注于 plusFastCyc ,我该如何制作它?

这个问题在GHC 7.8.2中仍然存在,所以这个问题仍然是相关的。

解决方案

GHC还为 SPECIALIZE 一个类型实例声明提供了一个选项。我试着用(扩展的) Foo.hs 代码加入以下内容:

 实例(Num r,V.Vector vr,Factored mr)=> Num(VT vmr)其中
{ - #SPECIALIZE实例(Factored m Int => Num(VT U.Vector m Int))# - }
VT x + VT y = VT $ V.zipWith (+)xy

然而,这一改变没有达到理想的加速。实现这种性能改进的是手动为具有相同函数定义的类型 VT U.Vector m Int 添加专用实例,如下所示:

  instance(Factored m Int)=> Num(VT U.Vector m Int)其中
VT x + VT y = VT $ V.zipWith(+)xy

这需要在 LANGUAGE 中添加 OverlappingInstances FlexibleInstances code>。有趣的是,在示例程序中,即使删除了每个 SPECIALIZE ,使用重叠实例获得的加速仍然存在,并且 INLINABLE 编译指示。

I'm having problems getting GHC to specialize a function with a class constraint. I have a minimal example of my problem here: Foo.hs and Main.hs. The two files compile (GHC 7.6.2, ghc -O3 Main) and run.

NOTE: Foo.hs is really stripped down. If you want to see why the constraint is needed, you can see a little more code here. If I put the code in a single file or make many other minor changes, GHC simply inlines the call to plusFastCyc. This will not happen in the real code because plusFastCyc is too large for GHC to inline, even when marked INLINE. The point is to specialize the call to plusFastCyc, not inline it. plusFastCyc is called in many places in the real code, so duplicating such a large function would not be desirable even if I could force GHC to do it.

The code of interest is the plusFastCyc in Foo.hs, reproduced here:

{-# INLINEABLE plusFastCyc #-}
{-# SPECIALIZE plusFastCyc :: 
         forall m . (Factored m Int) => 
              (FastCyc (VT U.Vector m) Int) -> 
                   (FastCyc (VT U.Vector m) Int) -> 
                        (FastCyc (VT U.Vector m) Int) #-}

-- Although the next specialization makes `fcTest` fast,
-- it isn't useful to me in my real program because the phantom type M is reified
-- {-# SPECIALIZE plusFastCyc :: 
--          FastCyc (VT U.Vector M) Int -> 
--               FastCyc (VT U.Vector M) Int -> 
--                    FastCyc (VT U.Vector M) Int #-}

plusFastCyc :: (Num (t r)) => (FastCyc t r) -> (FastCyc t r) -> (FastCyc t r)
plusFastCyc (PowBasis v1) (PowBasis v2) = PowBasis $ v1 + v2

The Main.hs file has two drivers: vtTest, which runs in ~3 seconds, and fcTest, which runs in ~83 seconds when compiled with -O3 using the forall'd specialization.

The core shows that for the vtTest test, the addition code is being specialized to Unboxed vectors over Ints, etc, while generic vector code is used for fcTest. On line 10, you can see that GHC does write a specialized version of plusFastCyc, compared to the generic version on line 167. The rule for the specialization is on line 225. I believe this rule should fire on line 270. (main6 calls iterate main8 y, so main8 is where plusFastCyc should be specialized.)

My goal is to make fcTest as fast as vtTest by specializing plusFastCyc. I've found two ways to do this:

  1. Explicity call inline from GHC.Exts in fcTest.
  2. Remove the Factored m Int constraint on plusFastCyc.

Option 1 is unsatisfactory because in the actual code base plusFastCyc is a frequently used operation and a very large function, so it should not be inlined at every use. Rather, GHC should call a specialized version of plusFastCyc. Option 2 is not really an option because I need the constraint in the real code.

I've tried a variety of options using (and not using) INLINE, INLINABLE, and SPECIALIZE, but nothing seems to work. (EDIT: I may have stripped out too much of plusFastCyc to make my example small, so INLINE might cause the function to be inlined. This doesn't happen in my real code because plusFastCyc is so large.) In this particular example, I'm not getting any match_co: needs more cases or RULE: LHS too complicated to desugar (and here) warnings, though I was getting many match_co warnings before minimizing the example. Presumably, the "problem" is the Factored m Int constraint in the rule; if I make changes to that constraint, fcTest runs as fast as vtTest.

Am I doing something GHC just doesn't like? Why won't GHC specialize the plusFastCyc, and how can I make it?

UPDATE

The problem persists in GHC 7.8.2, so this question is still relevant.

解决方案

GHC also gives an option to SPECIALIZE a type-class instance declaration. I tried this with the (expanded) code of Foo.hs, by putting the following:

instance (Num r, V.Vector v r, Factored m r) => Num (VT v m r) where 
    {-# SPECIALIZE instance ( Factored m Int => Num (VT U.Vector m Int)) #-}
    VT x + VT y = VT $ V.zipWith (+) x y

This change, though, did not achieve the desired speedup. What did achieve that performance improvement was manually adding a specialized instance for the type VT U.Vector m Int with the same function definitions, as follows:

instance (Factored m Int) => Num (VT U.Vector m Int) where 
    VT x + VT y = VT $ V.zipWith (+) x y

This requires adding OverlappingInstances and FlexibleInstances in LANGUAGE.

Interestingly, in the example program, the speedup obtained with the overlapping instance remains even if you remove every SPECIALIZE and INLINABLE pragma.

这篇关于具有约束的专业化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆