GCC编译错误,代码大于2 GB [英] GCC compile error with >2 GB of code

查看:311
本文介绍了GCC编译错误,代码大于2 GB的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大量的函数总共大约2.8 GB的目标代码(遗憾的是,没有办法,科学计算......)

当我尝试链接它们,我得到(预期)重定位被截断为适合:R_X86_64_32S 错误,我希望通过指定编译器标志 -mcmodel = medium 。除了我所控制的链接之外,所有链接库都是用 -fpic 标志编译的。



错误仍然存​​在,我假定我链接到的一些库不是用PIC编译的。



这是错误:

  /usr/lib/gcc/x86_64-redhat-linux/4.1.2 /../../../../ lib64 / crt1.o:在函数` _start':
(.text + 0x12):重定位被截断为适合:R_X86_64_32S针对/usr/lib64/libc_nonshared.a(elf-init.oS)中的.text节中定义的符号`__libc_csu_fini'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o:函数`_start':
(.text + 0x19 ):重定位被截断为适合:R_X86_64_32S针对/usr/lib64/libc_nonshared.a(elf-init.oS)中的.text节中定义的符号`__libc_csu_init'
/ usr / lib / gcc / x86_64-redhat-linux /4.1.2/../../../../lib64/crt1.o:函数`_start':
(.text + 0x20):对`main'的未定义引用
/usr/lib/gcc/x86_64-redhat-linux/4.1.2 /../../../../lib64/crti.o:函数`call_gmon_start':
(.text + 0x7):重定位被截断为适合:R_X86_64_GOTPCREL对未定义符号`__gmon_start__'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtbegin.o:函数`__do_global_dtors_aux':
crtstuff.c :(。text + 0xb):重定位被截断为适合:R_X86_64_PC32 '.bss'
crtstuff.c :(。text + 0x13):relocation truncated to fit:R_X86_64_32 against对象符号`__DTOR_END__'在/ usr / lib / gcc / x86_64-redhat-linux / 4.1.2 / crtend.o
crtstuff.c :(。text + 0x19):重定位被截断为适合:R_X86_64_32S针对`.dtors'
crtstuff.c :(。text + 0x28):relocation truncated适合:R_X86_64_PC32反对`.bss'
crtstuff.c :(。text + 0x38):重定位被截断为适合:R_X86_64_PC32反对`.bss'
crtstuff.c :(。text + 0x3f):重定位被截断为适合:R_X86_64_32S针对`.dtors'
crtstuff.c :(。text + 0x46):重定位被截断为适合:R_X86_64_PC32针对`.bss'
c rtstuff.c :(。text + 0x51):从输出中省略了额外的重定位溢出
collect2:ld返回1退出状态$ b $ make:*** [testsme]错误1

和我链接的系统库:

  -lgfortran -lm -lrt -lpthread 

在哪里寻找问题的线索? / b>

编辑:
首先,感谢您的讨论......
为了澄清一下,我有数百个函数(每个函数大约为1  ; MB大小在单独的目标文件中):

  double func1(std :: tr1 :: unordered_map< int,double> ; &安培; csc,
std :: vector< EvaluationNode :: Ptr> &安培; ti,
ProcessVars& s)
{
double sum,prefactor,expr;

prefactor = + s.ds8 * s.ds10 * ti [0] - > value();
expr =( - 5/243。*(s.x14 * s.x15 * csc [49300] + 9 / 10. * s.x14 * s.x15 * csc [49301] +
1 /10.*s.x14*s.x15*csc[49302] - 3 / 5. * s.x14 * s.x15 * csc [49303] -
27 / 10. * s.x14 * s。 x15 * csc [49304] + 12 / 5. * s.x14 * s.x15 * csc [49305] -
3 / 10. * s.x14 * s.x15 * csc [49306] - 4/5 * s.x14 * s.x15 * csc [49307] +
21 / 10. * s.x14 * s.x15 * csc [49308] + 1 / 10. * s.x14 * s.x15 * csc [49309] -
s.x14 * s.x15 * csc [51370] - 9/10。* s.x14 * s.x15 * csc [51371] -
1 / 10. * s .x14 * s.x15 * csc [51372] + 3 / 5. * s.x14 * s.x15 * csc [51373] +
27 / 10. * s.x14 * s.x15 * csc [51374 ] - 12 / 5. * s.x14 * s.x15 * csc [51375] +
3 / 10. * s.x14 * s.x15 * csc [51376] + 4 / 5. * s.x14 * s.x15 * csc [51377] -
21 / 10. * s.x14 * s.x15 * csc [51378] - 1/10 * s.x14 * s.x15 * csc [51379] -
2 * s.x14 * s.x15 * csc [55100] - 9 / 5. * s.x14 * s.x15 * csc [55101] -
1 / 5. * s.x14 * s.x15 * csc [55102] + 6/5 * s.x14 * s.x15 * csc [55103] +
27 / 5. * s.x14 * s.x15 * csc 24 /5.*s.x14*s.x15* csc [55105] +
3 / 5. * s.x14 * s.x15 * csc [55106] + 8 / 5. * s.x14 * s.x15 * csc [55107] -
21 /5.*s.x14*s.x15*csc[55108] - 1/5 * * s.x14 * s.x15 * csc [55109] -
2 * s.x14 * s.x15 * csc * s.x14 * s.x15 * csc [55171] -
1/5 * s.x14 * s.x15 * csc [55172] + 6 / 5. * s .x14 * s.x15 * csc [55173] +
27 / 5. * s.x14 * s.x15 * csc -24 / 5. * s.x14 * s.x15 * csc [55175 ] +
// ...
;

sum + = prefactor * expr;
// ...
返回金额;

$ / code>

对象 s 是相对较小的,并且保持所需的常量x14,x15,...,ds0,...等,而 ti 只是从外部库返回一个double。正如你所看到的, csc [] 是一个预先计算好的值映射,它也是在单独的目标文件中进行评估的(同样有数百个大小约为1 MB的大小)以下形式:

  void cscs132(std :: tr1 :: unordered_map< int,double>&csc,ProcessVars& s )
{
{
double csc19295 = + s.ds0 * s.ds1 * s.ds2 *( -
32 * s.x12pow2 * s.x15 * s.x34 * s.mbpow2 * s.mWpowinv2 -
32 * s.x12pow2 * s.x15 * s.x35 * s.mbpow2 * s.mWpowinv2 -
32 * s.x12pow2 * s.x15 * s .x35 * s.x45 * s.mWpowinv2 -
32 * s.x12pow2 * s.x25 * s.x34 * s.mbpow2 * s.mWpowinv2 -
32 * s.x12pow2 * s.x25 * s.x35 * s.mbpow2 * s.mWpowinv2 -
32 * s.x12pow2 * s.x25 * s.x35 * s.x45 * s.mWpowinv2 +
32 * s.x12pow2 * s .x34 * s.mbpow4 * s.mWpowinv2 +
32 * s.x12pow2 * s.x34 * s.x35 * s.mbpow2 * s.mWpowinv2 +
32 * s.x12pow2 * s.x34 * s.x45 * s.mbpow2 * s.mWpowinv2 +
32 * s.x12pow2 * s.x35 * s.mbpow4 * s.mWpowinv2 +
32 * s.x12pow2 * s.x35pow2 * s.mbpow2 * s.mWpowinv2 +
32 * s.x12pow2 * s.x35pow2 * s.x45 * s.mWpowinv2 +
64 * s.x12pow2 * s.x35 * s.x45 * s.mbpow2 * s.mWpowinv2 +
32 * s.x12pow2 * s.x35 * s.x45pow2 * s.mWpowinv2 -
64 * s .x12 * s.p1p3 * s.x15 * s.mbpow4 * s.mWpowinv2 +
64 * s.x12 * s.p1p3 * s.x15pow2 * s.mbpow2 * s.mWpowinv2 +
96 * s.x12 * s.p1p3 * s.x15 * s.x25 * s.mbpow2 * s.mWpowinv2 -
64 * s.x12 * s.p1p3 * s.x15 * s.x35 * s.mbpow2 * s.mWpowinv2 -
64 * s.x12 * s.p1p3 * s.x15 * s.x45 * s.mbpow2 * s.mWpowinv2 -
32 * s.x12 * s.p1p3 * s .x25 * s.mbpow4 * s.mWpowinv2 +
32 * s.x12 * s.p1p3 * s.x25pow2 * s.mbpow2 * s.mWpowinv2 -
32 * s.x12 * s.p1p3 * s.x25 * s.x35 * s.mbpow2 * s.mWpowinv2 -
32 * s.x12 * s.p1p3 * s.x25 * s.x45 * s.mbpow2 * s.mWpowinv2 -
32 * s.x12 * s.p1p3 * s.x45 * s.mbpow2 +
64 * s.x12 * s.x14 * s.x15pow2 * s.x35 * s.mWpowinv2 +
96 * s.x12 * s.x14 * s.x15 * s.x25 * s.x35 * s.mWpow inv2 +
32 * s.x12 * s.x14 * s.x15 * s.x34 * s.mbpow2 * s.mWpowinv2 -
32 * s.x12 * s.x14 * s.x15 * s.x35 * s.mbpow2 * s.mWpowinv2 -
64 * s.x12 * s.x14 * s.x15 * s.x35pow2 * s.mWpowinv2 -
32 * s.x12 * s。 x14 * s.x15 * s.x35 * s.x45 * s.mWpowinv2 +
32 * s.x12 * s.x14 * s.x25pow2 * s.x35 * s.mWpowinv2 +
32 * s.x12 * s.x14 * s.x25 * s.x34 * s.mbpow2 * s.mWpowinv2 -
32 * s.x12 * s.x14 * s.x25 * s.x35pow2 * s.mWpowinv2 -
// ...

csc.insert(cscMap :: value_type(192953,csc19295));
}

{
double csc19296 = // ...;

csc.insert(cscMap :: value_type(192956,csc19296));
}

// ...
}



<就是这样。最后一步就是调用所有这些 func [i] 并将结果相加。



关于事实是这是一个相当特殊和不寻常的情况:是的。这是人们在为粒子物理做高精度计算时必须应对的问题。



编辑2:
我还应该加上x12,x13等。不是真正的常量。它们被设置为特定值,运行所有这些函数并返回结果,然后选择一组新的x12,x13等来产生下一个值。并且这必须完成10 ^ 5到10 ^ 6次......

编辑3:
谢谢你的建议和讨论到目前为止.. 。我会尝试以代码生成的方式滚动循环,老实说,不知道如何完全实现,但这是最好的选择。


顺便说一句,我没有试图隐藏这是科学计算 - 没有办法优化。
这个代码的基础是来自于一个黑匣子,我没有真正的访问权限,而且整个事情用简单的例子很好地工作,我主要感到不知所措发生在一个真实世界的应用程序......

编辑4:
所以,我设法减少 csc 通过简化计算机代数系统中的表达式( Mathematica )。我现在也看到了一些方法,在生成代码之前应用其他一些技巧来降低它的数量级(这会将这部分降低到大约100 MB),我希望这个想法能够工作。

现在与您的答案相关:
我试图在 func s,CAS不会有多大帮助,但我已经有了一些想法。例如,按照像 x12,x13,... 这样的变量对表达式进行排序,用Python解析 csc s并生成相互关联的表格。然后我至少可以将这些部分生成为循环。因为这似乎是目前为止最好的解决方案,所以我将其标记为最佳答案。

然而,我想也赞扬VJo。 GCC 4.6的确能更好地工作,产生更小的代码并且速度更快。按原样使用大型模型在代码中工作。所以在技术上这是正确的答案,但改变整个概念是一个更好的方法。



谢谢大家的建议和帮助。如果任何人有兴趣,我会尽快发布最终结果。



备注:
只是对其他答案的一些评论:我试图运行的代码不是源于简单函数/算法的扩展和愚蠢的不必要的展开。实际发生的事情是,我们开始的东西是非常复杂的数学对象,并将它们转换为数字可计算的形式,从而生成这些表达式。问题实际上在于潜在的物理理论。中间表达式的复杂性是有规律地缩放的,这是众所周知的,但是当将所有这些东西与可物理测量的东西(一种可观察的东西)相结合时,它只归结为构成表达式基础的一小部分非常小的函数。 (在这方面肯定有一些错误,只有 可用称为扰动理论的ansatz)我们试图将这个ansatz带到另一个层面,这在分析上不再可行,而且所需函数的基础还不知道。所以我们试图像这样蛮横。不是最好的方式,但希望能够帮助我们理解最终的物理现象......



最后修改:
感谢您的所有建议,我设法使用Mathematica和针对 func 的代码生成器的修改大大减少了代码大小)



我已经使用Mathematica简化了 csc 函数,将其降至92 MB 。这是不可减少的部分。第一次尝试一直持续下去,但经过一些优化后,现在在单个CPU上运行约10分钟。



func s很引人注目:它们的整个代码大小已降至大约9 MB,所以代码现在总计在100 MB范围内。现在有必要开启优化并且执行速度非常快。

再次感谢大家的建议,我学到了很多东西。

解决方案

所以,你已经有一个产生这个文本的程序:

  prefactor = + s.ds8 * s.ds10 * ti [0]  - > value(); 
expr =( - 5/243。*(s.x14 * s.x15 * csc [49300] + 9 / 10. * s.x14 * s.x15 * csc [49301] +
1 /10.*s.x14*s.x15*csc[49302] - 3 / 5. * s.x14 * s.x15 * csc [49303] -...

  double csc19295 = + s.ds0 * s.ds1 * s.ds2 *( -  
32 * s.x12pow2 * s.x15 * s.x34 * s.mbpow2 * s.mWpowinv2 -
32 * s.x12pow2 * s.x15 * s.x35 * s.mbpow2 * s.mWpowinv2 -
32 * s.x12pow2 * s.x15 * s.x35 * s.x45 * s.mWpowinv2 -...



  • 更改生成器程序以输出偏移量(即代替字符串s.ds0它会产生 offsetof(ProcessVars,ds0)

  • 创建一个数组这样的偏移

  • 编写一个评估器,它接受上面的数组和结构点的基地址ers并产生结果



array +评估程序将表示与您的某个函数相同的逻辑,但只有评估程序将会是代码。该数组是数据,可以在运行时生成或保存在磁盘上,并读取我的块或内存映射文件。



对于func1中的特定示例,想象一下如果您可以访问 s csc 的基地址,并且还可以访问基地址,那么如何通过评估者重写函数类似于常量表示的向量,以及需要添加到基址以获得 x14 ds8 的偏移量,以及 csc [51370]



您需要创建一种新的数据形式来描述如何处理您传递给您的大量功能的实际数据。

I have a huge number of functions totaling around 2.8 GB of object code (unfortunately there's no way around, scientific computing ...)

When I try to link them, I get (expected) relocation truncated to fit: R_X86_64_32S errors, that I hoped to circumvent by specifing the compiler flag -mcmodel=medium. All libraries that are linked in addition that I have control of are compiled with the -fpic flag.

Still, the error persists, and I assume that some libraries I link to are not compiled with PIC.

Here's the error:

/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x12): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_fini'     defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS)
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x19): relocation truncated to fit: R_X86_64_32S against symbol `__libc_csu_init'    defined in .text section in /usr/lib64/libc_nonshared.a(elf-init.oS)
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crt1.o: In function `_start':
(.text+0x20): undefined reference to `main'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/../../../../lib64/crti.o: In function    `call_gmon_start':
(.text+0x7): relocation truncated to fit: R_X86_64_GOTPCREL against undefined symbol      `__gmon_start__'
/usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtbegin.o: In function `__do_global_dtors_aux':
crtstuff.c:(.text+0xb): relocation truncated to fit: R_X86_64_PC32 against `.bss' 
crtstuff.c:(.text+0x13): relocation truncated to fit: R_X86_64_32 against symbol `__DTOR_END__' defined in .dtors section in /usr/lib/gcc/x86_64-redhat-linux/4.1.2/crtend.o
crtstuff.c:(.text+0x19): relocation truncated to fit: R_X86_64_32S against `.dtors'
crtstuff.c:(.text+0x28): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x38): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x3f): relocation truncated to fit: R_X86_64_32S against `.dtors'
crtstuff.c:(.text+0x46): relocation truncated to fit: R_X86_64_PC32 against `.bss'
crtstuff.c:(.text+0x51): additional relocation overflows omitted from the output
collect2: ld returned 1 exit status
make: *** [testsme] Error 1

And system libraries I link against:

-lgfortran -lm -lrt -lpthread

Any clues where to look for the problem?

EDIT: First of all, thank you for the discussion ... To clarify a bit, I have hundreds of functions (each approx 1 MB in size in separate object files) like this:

double func1(std::tr1::unordered_map<int, double> & csc, 
             std::vector<EvaluationNode::Ptr> & ti, 
             ProcessVars & s)
{
    double sum, prefactor, expr;

    prefactor = +s.ds8*s.ds10*ti[0]->value();
    expr =       ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +
           1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -
           27/10.*s.x14*s.x15*csc[49304] + 12/5.*s.x14*s.x15*csc[49305] -
           3/10.*s.x14*s.x15*csc[49306] - 4/5.*s.x14*s.x15*csc[49307] +
           21/10.*s.x14*s.x15*csc[49308] + 1/10.*s.x14*s.x15*csc[49309] -
           s.x14*s.x15*csc[51370] - 9/10.*s.x14*s.x15*csc[51371] -
           1/10.*s.x14*s.x15*csc[51372] + 3/5.*s.x14*s.x15*csc[51373] +
           27/10.*s.x14*s.x15*csc[51374] - 12/5.*s.x14*s.x15*csc[51375] +
           3/10.*s.x14*s.x15*csc[51376] + 4/5.*s.x14*s.x15*csc[51377] -
           21/10.*s.x14*s.x15*csc[51378] - 1/10.*s.x14*s.x15*csc[51379] -
           2*s.x14*s.x15*csc[55100] - 9/5.*s.x14*s.x15*csc[55101] -
           1/5.*s.x14*s.x15*csc[55102] + 6/5.*s.x14*s.x15*csc[55103] +
           27/5.*s.x14*s.x15*csc[55104] - 24/5.*s.x14*s.x15*csc[55105] +
           3/5.*s.x14*s.x15*csc[55106] + 8/5.*s.x14*s.x15*csc[55107] -
           21/5.*s.x14*s.x15*csc[55108] - 1/5.*s.x14*s.x15*csc[55109] -
           2*s.x14*s.x15*csc[55170] - 9/5.*s.x14*s.x15*csc[55171] -
           1/5.*s.x14*s.x15*csc[55172] + 6/5.*s.x14*s.x15*csc[55173] +
           27/5.*s.x14*s.x15*csc[55174] - 24/5.*s.x14*s.x15*csc[55175] +
           // ...
           ;

        sum += prefactor*expr;
    // ...
    return sum;
}

The object s is relatively small and keeps the needed constants x14, x15, ..., ds0, ..., etc. while ti just returns a double from an external library. As you can see, csc[] is a precomputed map of values which is also evaluated in separate object files (again hundreds with about ~1 MB of size each) of the following form:

void cscs132(std::tr1::unordered_map<int,double> & csc, ProcessVars & s)
{
    {
    double csc19295 =       + s.ds0*s.ds1*s.ds2 * ( -
           32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12pow2*s.x25*s.x35*s.x45*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.mbpow4*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.x35*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x34*s.x45*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35*s.mbpow4*s.mWpowinv2 +
           32*s.x12pow2*s.x35pow2*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35pow2*s.x45*s.mWpowinv2 +
           64*s.x12pow2*s.x35*s.x45*s.mbpow2*s.mWpowinv2 +
           32*s.x12pow2*s.x35*s.x45pow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.mbpow4*s.mWpowinv2 +
           64*s.x12*s.p1p3*s.x15pow2*s.mbpow2*s.mWpowinv2 +
           96*s.x12*s.p1p3*s.x15*s.x25*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.p1p3*s.x15*s.x45*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.mbpow4*s.mWpowinv2 +
           32*s.x12*s.p1p3*s.x25pow2*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.x35*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x25*s.x45*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.p1p3*s.x45*s.mbpow2 +
           64*s.x12*s.x14*s.x15pow2*s.x35*s.mWpowinv2 +
           96*s.x12*s.x14*s.x15*s.x25*s.x35*s.mWpowinv2 +
           32*s.x12*s.x14*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
           64*s.x12*s.x14*s.x15*s.x35pow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x15*s.x35*s.x45*s.mWpowinv2 +
           32*s.x12*s.x14*s.x25pow2*s.x35*s.mWpowinv2 +
           32*s.x12*s.x14*s.x25*s.x34*s.mbpow2*s.mWpowinv2 -
           32*s.x12*s.x14*s.x25*s.x35pow2*s.mWpowinv2 -
           // ...

       csc.insert(cscMap::value_type(192953, csc19295));
    }

    {
       double csc19296 =      // ... ;

       csc.insert(cscMap::value_type(192956, csc19296));
    }

    // ...
}

That's about it. The final step then just consists in calling all those func[i] and summing the result up.

Concerning the fact that this is a rather special and unusual case: Yes, it is. This is what people have to cope with when trying to do high precision computations for particle physics.

EDIT2: I should also add that x12, x13, etc. are not really constants. They are set to specific values, all those functions are run and the result returned, and then a new set of x12, x13, etc. is chosen to produce the next value. And this has to be done 10^5 to 10^6 times...

EDIT3: Thank you for the suggestions and the discussion so far... I'll try to roll the loops up upon code generation somehow, not sure how to this exactly, to be honest, but this is the best bet.

BTW, I didn't try to hide behind "this is scientific computing -- no way to optimize". It's just that the basis for this code is something that comes out of a "black box" where I have no real access to and, moreover, the whole thing worked great with simple examples, and I mainly feel overwhelmed with what happens in a real world application ...

EDIT4: So, I have managed to reduce the code size of the csc definitions by about one forth by simplifying expressions in a computer algebra system (Mathematica). I see now also some way to reduce it by another order of magnitude or so by applying some other tricks before generating the code (which would bring this part down to about 100 MB) and I hope this idea works.

Now related to your answers: I'm trying to roll the loops back up again in the funcs, where a CAS won't help much, but I have already some ideas. For instance, sorting the expressions by the variables like x12, x13,..., parse the cscs with Python and generate tables that relate them to each other. Then I can at least generate these parts as loops. As this seems to be the best solution so far, I mark this as the best answer.

However, I'd like to also give credit to VJo. GCC 4.6 indeed works much better, produces smaller code and is faster. Using the large model works at the code as-is. So technically this is the correct answer, but changing the whole concept is a much better approach.

Thank you all for your suggestions and help. If anyone is interested, I'm going to post the final outcome as soon as I am ready.

REMARKS: Just some remarks to some other answers: The code I'm trying to run does not originate in an expansion of simple functions/algorithms and stupid unnecessary unrolling. What actually happens is that the stuff we start with is pretty complicated mathematical objects and bringing them to a numerically computable form generates these expressions. The problem lies actually in the underlying physical theory. Complexity of intermediate expressions scales factorially, which is well known, but when combining all of this stuff to something physically measureable -- an observable -- it just boils down to only a handful of very small functions that form the basis of the expressions. (There is definitely something "wrong" in this respect with the general and only available ansatz which is called "perturbation theory") We try to bring this ansatz to another level, which is not feasible analytically anymore and where the basis of needed functions is not known. So we try to brute-force it like this. Not the best way, but hopefully one that helps with our understanding of the physics at hand in the end...

LAST EDIT: Thanks to all your suggestions, I've managed to reduce the code size considerably, using Mathematica and a modification of the code generator for the funcs somewhat along the lines of the top answer :)

I have simplified the csc functions with Mathematica, bringing it down to 92 MB. This is the irreducible part. The first attempts took forever, but after some optimizations this now runs through in about 10 minutes on a single CPU.

The effect on the funcs was dramatic: The whole code size for them is down to approximately 9 MB, so the code now totals in the 100 MB range. Now it makes sense to turn optimizations on and the execution is quite fast.

Again, thank you all for your suggestions, I've learned a lot.

解决方案

So, you already have a program that produces this text:

prefactor = +s.ds8*s.ds10*ti[0]->value();
expr = ( - 5/243.*(s.x14*s.x15*csc[49300] + 9/10.*s.x14*s.x15*csc[49301] +
       1/10.*s.x14*s.x15*csc[49302] - 3/5.*s.x14*s.x15*csc[49303] -...

and

double csc19295 =       + s.ds0*s.ds1*s.ds2 * ( -
       32*s.x12pow2*s.x15*s.x34*s.mbpow2*s.mWpowinv2 -
       32*s.x12pow2*s.x15*s.x35*s.mbpow2*s.mWpowinv2 -
       32*s.x12pow2*s.x15*s.x35*s.x45*s.mWpowinv2 -...

right?

If all your functions have a similar "format" (multiply n numbers m times and add the results - or something similar) then I think you can do this:

  • change the generator program to output offsets instead of strings (i.e. instead of the string "s.ds0" it will produce offsetof(ProcessVars, ds0)
  • create an array of such offsets
  • write an evaluator which accepts the array above and the base addresses of the structure pointers and produces an result

The array+evaluator will represent the same logic as one of your functions, but only the evaluator will be code. The array is "data" and can be either generated at runtime or saved on disk and read i chunks or with a memory mapped file.

For your particular example in func1 imagine how you would rewrite the function via an evaluator if you had access to the base address of s and csc and also a vector like representation of the constants and the offsets you need to add to the base addresses to get to x14, ds8 and csc[51370]

You need to create a new form of "data" that will describe how to process the actual data you pass to your huge number of functions.

这篇关于GCC编译错误,代码大于2 GB的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆