如果遇到非有限值(NA、NaN 或 Inf),如何强制出错 [英] How to force an error if non-finite values (NA, NaN, or Inf) are encountered

查看:25
本文介绍了如果遇到非有限值(NA、NaN 或 Inf),如何强制出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我错过了 Matlab 中的一个条件调试标志:dbstop if infnan 此处描述.如果设置,此条件将在遇到 InfNaN 时停止代码执行(IIRC,Matlab 没有 NA).

与在每次赋值操作后测试所有对象相比,我如何在 R 中以更有效的方式实现这一点?

目前,我看到的唯一方法是通过以下黑客攻击:

  1. 在可能遇到这些值的所有位置之后手动插入测试(例如,除法,其中可能会出现除以 0 的情况).测试将使用 is.finite(), 本问答中所述A,在每个元素上.
  2. 使用 body() 修改代码以调用单独的函数,在每个操作或可能只是每个分配之后,它会测试所有对象(可能还有所有环境中的所有对象).
  3. 修改 R 的源代码 (?!?)
  4. 尝试使用 tracemem 来识别已更改的变量,并仅检查这些变量是否存在错误值.
  5. (新 - 见注 2)使用某种调用处理程序/回调来调用测试函数.

第一个选项是我目前正在做的事情.这很乏味,因为我不能保证我已经检查了所有内容.第二个选项将测试所有内容,即使对象尚未更新.这是对时间的巨大浪费.第三个选项将涉及修改 NA、NaN 和无限值 (+/- Inf) 的分配,从而产生错误.这似乎最好留给 R Core.第四个选项就像第二个 - 我需要调用一个单独的函数,列出所有的内存位置,只是为了识别那些已经改变的,然后检查值;我什至不确定这是否适用于所有对象,因为程序可能会进行就地修改,这似乎不会调用 duplicate 函数.

有没有更好的方法我错过了?也许是 Mark Bravington、Luke Tierney 的一些聪明的工具,或者一些相对基本的东西——类似于 options() 参数或编译 R 时的标志?

示例代码 这里有一些非常简单的示例代码供测试,其中包含 Josh O'Brien 提出的 addTaskCallback 函数.代码没有中断,但在第一种情况下确实发生了错误,而在第二种情况下没有发生错误(即 badDiv(0,0,FALSE) 不会中止).我仍在调查回调,因为这看起来很有希望.

badDiv <- function(x, y, flag){z = x/y如果(标志 == 真){返回(z)} 别的 {返回(假)}}addTaskCallback(stopOnNaNs)badDiv(0, 0, TRUE)addTaskCallback(stopOnNaNs)badDiv(0, 0, FALSE)

<小时>

注意 1. 我会对标准 R 操作的解决方案感到满意,尽管我的很多计算都涉及通过 data.tablebigmemory 使用的对象(即基于磁盘的内存映射矩阵).这些似乎与标准矩阵和 data.frame 操作有一些不同的内存行为.

注意 2. 回调的想法似乎更有希望,因为这不需要我编写改变 R 代码的函数,例如通过 body() 的想法.

注3.我不知道是否有一些简单的方法来测试非有限值的存在,例如关于对象的元信息,索引 NA、Infs 等存储在对象中的位置,或者这些对象是否存储在适当的位置.到目前为止,我已经尝试了 Simon Urbanek 的 inspect 包,但还没有找到一种方法来判断是否存在非数字值.

跟进:Simon Urbanek 在评论中指出,此类信息不可用作对象的元信息.

注 4.我仍在测试提出的想法.此外,正如 Simon 所建议的,在 C/C++ 中测试是否存在非有限值应该是最快的;这甚至应该超过已编译的 R 代码,但我对任何事情都持开放态度.对于大型数据集,例如大约 10-50GB,这应该比复制数据节省大量资金.通过使用多个内核可能会得到进一步的改进,但这更高级一些.

解决方案

恐怕没有这样的捷径.理论上,在 unix 上,您可以使用 SIGFPE,但实际上

  1. 没有标准的方法来启用 FP 操作来捕获它(即使 C99 也不包含对此的规定) - 它是高度系统特定的(例如 Linux 上的 feenableexceptfp_enable_all 在 AIX 等上)或需要为您的目标 CPU 使用汇编程序
  2. 如今,FP 操作通常在 SSE 等向量单元中完成,因此您甚至无法确定是否涉及 FPU,并且
  3. R 会拦截一些对诸如 NaNs、NAs 之类的操作并分别处理它们,因此它们不会进入 FP 代码

也就是说,如果您足够努力(禁用 SSE 等),您可以自己破解一个 R,它会为您的平台和 CPU 捕获一些异常.我们不会考虑将其构建到 R 中,但出于特殊目的,它可能是可行的.

但是,除非您更改 R 内部代码,否则它仍然无法捕获 NaN/NA 操作.此外,您必须检查您正在使用的每个包,因为它们可能在其 C 代码中使用 FP 操作,并且还可能单独处理 NA/NaN.p>

如果您只担心除以零或上溢/下溢之类的事情,上述方法将起作用,并且可能最接近解决方案.

仅仅检查你的结果可能不是很可靠,因为你不知道结果是否基于一些中间 NaN 计算,它改变了可能不需要是 的聚合值NaN 也是如此.如果您愿意放弃这种情况,那么您可以简单地递归遍历您的结果对象或工作区.这不应该是非常低效的,因为你只需要担心 REALSXP 而不是其他任何事情(除非你也不喜欢 NAs - 那么你会有更多工作).

<小时>

这是一个可用于递归遍历 R 对象的示例代码:

静态 int do_isFinite(SEXP x) {/* 递归到通用向量(列表) */if (TYPEOF(x) == VECSXP) {int n = 长度(x);for (int i = 0; i < n; i++)if (!do_isFinite(VECTOR_ELT(x, i))) 返回 0;}/* 递归成对列表 */if (TYPEOF(x) == LISTSXP) {而(x!= R_NilValue){if (!do_isFinite(CAR(x))) 返回 0;x = CDR(x);}返回 1;}/* 除了 S4,我不会关心属性其中属性是插槽 */if (IS_S4_OBJECT(x) && !do_isFinite(ATTRIB(x))) 返回 0;/* 检查实数 */if (TYPEOF(x) == REALSXP) {int n = 长度(x);双 *d = REAL(x);for (int i = 0; i 

There's a conditional debugging flag I miss from Matlab: dbstop if infnan described here. If set, this condition will stop code execution when an Inf or NaN is encountered (IIRC, Matlab doesn't have NAs).

How might I achieve this in R in a more efficient manner than testing all objects after every assignment operation?

At the moment, the only ways I see to do this are via hacks like the following:

  1. Manually insert a test after all places where these values might be encountered (e.g. a division, where division by 0 may occur). The testing would be to use is.finite(), described in this Q & A, on every element.
  2. Use body() to modify the code to call a separate function, after each operation or possibly just each assignment, which tests all of the objects (and possibly all objects in all environments).
  3. Modify R's source code (?!?)
  4. Attempt to use tracemem to identify those variables that have changed, and check only these for bad values.
  5. (New - see note 2) Use some kind of call handlers / callbacks to invoke a test function.

The 1st option is what I am doing at present. This is tedious, because I can't guarantee I've checked everything. The 2nd option will test everything, even if an object hasn't been updated. That is a massive waste of time. The 3rd option would involve modifying assignments of NA, NaN, and infinite values (+/- Inf), so that an error is produced. That seems like it's better left to R Core. The 4th option is like the 2nd - I'd need a call to a separate function listing all of the memory locations, just to ID those that have changed, and then check the values; I'm not even sure this will work for all objects, as a program may do an in-place modification, which seems like it would not invoke the duplicate function.

Is there a better approach that I'm missing? Maybe some clever tool by Mark Bravington, Luke Tierney, or something relatively basic - something akin to an options() parameter or a flag when compiling R?

Example code Here is some very simple example code to test with, incorporating the addTaskCallback function proposed by Josh O'Brien. The code isn't interrupted, but an error does occur in the first scenario, while no error occurs in the second case (i.e. badDiv(0,0,FALSE) doesn't abort). I'm still investigating callbacks, as this looks promising.

badDiv  <- function(x, y, flag){
    z = x / y
    if(flag == TRUE){
        return(z)
    } else {
        return(FALSE)
    }
}

addTaskCallback(stopOnNaNs)
badDiv(0, 0, TRUE)

addTaskCallback(stopOnNaNs)
badDiv(0, 0, FALSE)


Note 1. I'd be satisfied with a solution for standard R operations, though a lot of my calculations involve objects used via data.table or bigmemory (i.e. disk-based memory mapped matrices). These appear to have somewhat different memory behaviors than standard matrix and data.frame operations.

Note 2. The callbacks idea seems a bit more promising, as this doesn't require me to write functions that mutate R code, e.g. via the body() idea.

Note 3. I don't know whether or not there is some simple way to test the presence of non-finite values, e.g. meta information about objects that indexes where NAs, Infs, etc. are stored in the object, or if these are stored in place. So far, I've tried Simon Urbanek's inspect package, and have not found a way to divine the presence of non-numeric values.

Follow-up: Simon Urbanek has pointed out in a comment that such information is not available as meta information for objects.

Note 4. I'm still testing the ideas presented. Also, as suggested by Simon, testing for the presence of non-finite values should be fastest in C/C++; that should surpass even compiled R code, but I'm open to anything. For large datasets, e.g. on the order of 10-50GB, this should be a substantial savings over copying the data. One may get further improvements via use of multiple cores, but that's a bit more advanced.

解决方案

I fear there is no such shortcut. In theory on unix there is SIGFPE that you could trap on, but in practice

  1. there is no standard way to enable FP operations to trap it (even C99 doesn't include a provision for that) - it is highly system-specifc (e.g. feenableexcept on Linux, fp_enable_all on AIX etc.) or requires the use of assembler for your target CPU
  2. FP operations are nowadays often done in vector units like SSE so you can't be even sure that FPU is involved and
  3. R intercepts some operations on things like NaNs, NAs and handles them separately so they won't make it to the FP code

That said, you could hack yourself an R that will catch some exceptions for your platform and CPU if you tried hard enough (disable SSE etc.). It is not something we would consider building into R, but for a special purpose it may be doable.

However, it would still not catch NaN/NA operations unless you change R internal code. In addition, you would have to check every single package you are using since they may be using FP operations in their C code and may also handle NA/NaN separately.

If you are only worried about things like division by zero or over/underflows, the above will work and is probably the closest to something like a solution.

Just checking your results may not be very reliable, because you don't know whether a result is based on some intermediate NaN calculation that changed an aggregated value which may not need to be NaN as well. If you are willing to discard such case, then you could simply walk recursively through your result objects or the workspace. That should not be extremely inefficient, because you only need to worry about REALSXP and not anything else (unless you don't like NAs either - then you'd have more work).


This is an example code that could be used to traverse R object recursively:

static int do_isFinite(SEXP x) {
    /* recurse into generic vectors (lists) */
    if (TYPEOF(x) == VECSXP) {
        int n = LENGTH(x);
        for (int i = 0; i < n; i++)
            if (!do_isFinite(VECTOR_ELT(x, i))) return 0;
    }
    /* recurse into pairlists */ 
    if (TYPEOF(x) == LISTSXP) {
         while (x != R_NilValue) {
             if (!do_isFinite(CAR(x))) return 0;
             x = CDR(x);
         }
         return 1;
    }
    /* I wouldn't bother with attributes except for S4
       where attributes are slots */
    if (IS_S4_OBJECT(x) && !do_isFinite(ATTRIB(x))) return 0;
    /* check reals */
    if (TYPEOF(x) == REALSXP) {
        int n = LENGTH(x);
        double *d = REAL(x);
        for (int i = 0; i < n; i++) if (!R_finite(d[i])) return 0;
    }
    return 1; 
}

SEXP isFinite(SEXP x) { return ScalarLogical(do_isFinite(x)); }

# in R: .Call("isFinite", x)

这篇关于如果遇到非有限值(NA、NaN 或 Inf),如何强制出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆