从全局R进程中隔离局部环境的随机性 [英] Isolate randomness of a local environment from the global R process

查看:76
本文介绍了从全局R进程中隔离局部环境的随机性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们可以使用set.seed()在R中设置随机种子,这具有全局作用.这是一个说明我的目标的最小示例:

We can use set.seed() to set a random seed in R, and this has a global effect. Here is a minimal example to illustrate my goal:

set.seed(0)
runif(1)
# [1] 0.8966972

set.seed(0)
f <- function() {
  # I do not want this random number to be affected by the global seed
  runif(1)
}
f()
# [1] 0.8966972

基本上,我希望能够避免局部随机种子(例如.Random.seed)在局部环境(例如R函数)中的影响,以便我可以实现用户具有的某种随机性无控制.例如,即使用户使用set.seed(),每次调用此函数时,他仍将获得不同的输出.

Basically I want to be able to avoid the effect of the global random seed (i.e., .Random.seed) in a local environment, such as an R function, so that I can achieve some sort of randomness over which the user has no control. For example, even if the user has set.seed(), he will still get different output every time he calls this function.

现在有两种实现.第一个依靠set.seed(NULL),每当我想获取一些随机数时,R都会重新初始化随机种子:

Now there are two implementations. The first one relies on set.seed(NULL) to let R re-initialize the random seed every time I want to get some random numbers:

createUniqueId <- function(bytes) {
  withPrivateSeed(
    paste(as.hexmode(sample(256, bytes, replace = TRUE) - 1), collapse = "")
  )
}
withPrivateSeed <- function(expr, seed = NULL) {
  oldSeed <- if (exists('.Random.seed', envir = .GlobalEnv, inherits = FALSE)) {
    get('.Random.seed', envir = .GlobalEnv, inherits = FALSE)
  }
  if (!is.null(oldSeed)) {
    on.exit(assign('.Random.seed', oldSeed, envir = .GlobalEnv), add = TRUE)
  }
  set.seed(seed)
  expr
}

即使将种子设置为0,您仍然可以看到不同的id字符串,并且全局随机数流仍可重现:

You can see I get different id strings even if I set the seed to 0, and the global random number stream is still reproducible:

> set.seed(0)
> runif(3)
[1] 0.8966972 0.2655087 0.3721239
> createUniqueId(4)
[1] "83a18600"
> runif(3)
[1] 0.5728534 0.9082078 0.2016819

> set.seed(0)
> runif(3)  # same
[1] 0.8966972 0.2655087 0.3721239
> createUniqueId(4)  # different
[1] "77cb3d91"
> runif(3)
[1] 0.5728534 0.9082078 0.2016819

> set.seed(0)
> runif(3)
[1] 0.8966972 0.2655087 0.3721239
> createUniqueId(4)
[1] "c41d61d8"
> runif(3)
[1] 0.5728534 0.9082078 0.2016819

第二个实现可以在Github上的此处中找到.它更复杂,基本思想是:

The second implementation can be found here on Github. It is more complicated, and the basic idea is:

  • 使用set.seed(NULL)(在.onLoad()中)在程序包启动期间初始化随机种子
  • 将随机种子存储在单独的环境(.globals$ownSeed)
  • 每次我们想生成随机数时:
  • initialize the random seed during package startup using set.seed(NULL) (in .onLoad())
  • store the random seed in a separate environment (.globals$ownSeed)
  • each time when we want to generate random numbers:
  1. 将本地种子分配给全局随机种子
  2. 生成随机数
  3. 将新的全局种子(由于步骤2而更改)分配给本地种子
  4. 将全局种子恢复为原始值

现在我的问题是两种方法在理论上是否等效.第一种方法的随机性取决于调用createUniqueId()时的当前时间和进程ID,而第二种方法似乎依赖于加载程序包时的时间和进程ID.对于第一种方法,是否可能在同一R进程中完全同时发生两个createUniqueId()调用,以便它们返回相同的id字符串?

Now my question is if the two approaches are equivalent in theory. The randomness of first approach relies on the current time and process ID when createUniqueId() is called, and the second approach seems to rely on the time and process ID when the package is loaded. For the first approach, is it possible that two calls of createUniqueId() happen exactly at the same time in the same R process so that they return the same id string?

在下面的答案中,Robert Krzyzanowski提供了一些经验证据,证明set.seed(NULL)可能导致严重的ID碰撞.我为此做了简单的可视化:

In the answer below, Robert Krzyzanowski provided some empirical evidence that set.seed(NULL) can lead to serious ID collisions. I did a simple visualization for it:

createGlobalUniqueId <- function(bytes) {
  paste(as.hexmode(sample(256, bytes, replace = TRUE) - 1), collapse = "")
}
n <- 10000
length(unique(replicate(n, createGlobalUniqueId(5))))
length(unique(x <- replicate(n, createUniqueId(5))))
# denote duplicated values by 1, and unique ones by 0
png('rng-time.png', width = 4000, height = 400)
par(mar = c(4, 4, .1, .1), xaxs = 'i')
plot(1:n, duplicated(x), type = 'l')
dev.off()

的随机数>

当线到达图的顶部时,这意味着将生成重复的值.但是,请注意,这些重复项不会连续出现,即any(x[-1] == x[-n])通常是FALSE.可能存在与系统时间关联的复制模式.由于缺乏对基于时间的随机种子的工作原理的了解,我无法进行进一步的调查,但是您可以看到相关的C源代码

When the line reaches the top of the plot, that means there is a duplicate value generated. However, note these duplicates do not come successively, i.e. any(x[-1] == x[-n]) is normally FALSE. There might be a pattern for the duplication associated with the system time. I'm not able to investigate further due to my lack of understanding of how the time-based random seed works, but you can see the relevant pieces of C source code here and here.

推荐答案

我认为在您的函数中仅包含一个独立的RNG会很好,它不受全局种子的影响,但拥有自己的种子.事实证明,randtoolbox提供了以下功能:

I thought it would be nice to have just an independent RNG inside your function, that is not affected by the global seed, but would have its own seed. Turns out, randtoolbox offers this functionality:

library(randtoolbox)
replicate(3, {
  set.seed(1)
  c(runif(1), WELL(3), runif(1))
})   
#            [,1]      [,2]      [,3]
#[1,] 0.265508663 0.2655087 0.2655087
#[2,] 0.481195594 0.3999952 0.9474923
#[3,] 0.003865934 0.6596869 0.4684255
#[4,] 0.484556709 0.9923884 0.1153879
#[5,] 0.372123900 0.3721239 0.3721239

顶部和底部的行受种子影响,而中间的行是真正随机的".

Top and bottom rows are affected by the seed, whereas middle ones are "truly random".

基于此,这是您函数的实现:

Based on that, here's the implementation of your function:

sample_WELL <- function(n, size=n) {
  findInterval(WELL(size), 0:n/n)
}

createUniqueId_WELL <- function(bytes) {
  paste(as.hexmode(sample_WELL(256, bytes) - 1), collapse = "")
}

length(unique(replicate(10000, createUniqueId_WELL(5))))
#[1] 10000

# independency on the seed: 
set.seed(1)
x <- replicate(100, createGlobalUniqueId(5))
x_WELL <- replicate(100, createUniqueId_WELL(5))
set.seed(1)
y <- replicate(100, createGlobalUniqueId(5))
y_WELL <- replicate(100, createUniqueId_WELL(5))
sum(x==y)
#[1] 100
sum(x_WELL==y_WELL)
#[1] 0

修改

要了解为什么我们得到重复的键,我们应该看看调用set.seed(NULL)时会发生什么.所有与RNG相关的代码都是用C编写的,因此我们应该直接转到 svn.r- project.org/R/trunk/src/main/RNG.c ,并参考功能do_setseed.如果seed = NULL,则显然会调用TimeToSeed.有一条评论指出它应该位于datetime.c中,但是可以在 svn.r-project.org/R/trunk/src/main/times.c .

To understand why we get duplicated keys, we should take a look what happens when we call set.seed(NULL). All RNG-related code is written in C, so we should go directly to svn.r-project.org/R/trunk/src/main/RNG.c and refer to the function do_setseed. If seed = NULL then clearly TimeToSeed is called. There's a comment that states it should be located in datetime.c, however, it can be found in svn.r-project.org/R/trunk/src/main/times.c.

导航R源可能很困难,因此我将函数粘贴在这里:

Navigating the R source can be difficult, so I'm pasting the function here:

/* For RNG.c, main.c, mkdtemp.c */
attribute_hidden
unsigned int TimeToSeed(void)
{
    unsigned int seed, pid = getpid();
#if defined(HAVE_CLOCK_GETTIME) && defined(CLOCK_REALTIME)
    {
    struct timespec tp;
    clock_gettime(CLOCK_REALTIME, &tp);
    seed = (unsigned int)(((uint_least64_t) tp.tv_nsec << 16) ^ tp.tv_sec);
    }
#elif defined(HAVE_GETTIMEOFDAY)
    {
    struct timeval tv;
    gettimeofday (&tv, NULL);
    seed = (unsigned int)(((uint_least64_t) tv.tv_usec << 16) ^ tv.tv_sec);
    }
#else
    /* C89, so must work */
    seed = (Int32) time(NULL);
#endif
    seed ^= (pid <<16);
    return seed;
}

因此,每次我们调用set.seed(NULL)时,R都会执行以下步骤:

So each time we call set.seed(NULL), R does these steps:

  1. 以秒和纳秒为单位获取当前时间(如果可能,请在#if defined块中获取平台依赖性)
  2. 应用位移到纳秒,结果异或"结果以秒为单位
  3. 对pid应用位移,并将其与先前的结果进行异或"操作
  4. 将结果设置为新种子
  1. Takes current time in seconds and nanoseconds (if possible, platform dependency here in #if defined blocks)
  2. Applies bit shift to nanoseconds and bit 'xor'es result with seconds
  3. Applies bit shift to pid and bit 'xor'es it with the previous result
  4. Sets the result as a new seed

好吧,现在很明显,当结果种子碰撞时,我们得到了重复的值.我的猜测是,当两个调用在1秒钟内发生时,就会发生这种情况,因此tv_sec是恒定的.为了确认这一点,我要介绍一个滞后:

Well, now it's almost obvious that we get duplicated values when the resulting seeds collide. My guess is this happens when two calls fall within 1 second, so that tv_sec is constant. To confirm that, I'm introducing a lag:

createUniqueIdWithLag <- function(bytes, lag) {
  Sys.sleep(lag)
  createUniqueId(bytes)
}
lags <- 1 / 10 ^ (1:5)
sapply(lags, function(x) length(unique(replicate(n, createUniqueIdWithLag(5, x)))))
[1] 1000 1000  996  992  990

令人困惑的是,即使滞后时间与纳秒相比也很大,我们仍然会发生碰撞!然后让我们进一步进行挖掘,我为种子编写了一个调试模拟器":

What's confusing is that even the lag is substantial compared to nanoseconds, we still get collisions! Let's dig it further then, I wrote a "debugging emulator" for the seed:

emulate_seed <- function() {
  tv <- as.numeric(system('echo $(($(date +%s%N)))', intern = TRUE))
  pid <- Sys.getpid()
  tv_nsec <- tv %% 1e9
  tv_sec <- tv %/% 1e9
  seed <- bitwXor(bitwShiftL(tv_nsec, 16), tv_sec)
  seed <- bitwXor(bitwShiftL(pid, 16), seed)
  c(seed, tv_nsec, tv_sec, pid)
}

z <- replicate(1000, emulate_seed())
sapply(1:4, function(i) length(unique(z[i, ])))
# unique seeds, nanosecs, secs, pids:
#[1]  941 1000   36    1

这确实令人困惑:纳秒都是唯一的,并且不能保证最终种子的唯一性.为了证明这种效果,这是重复项之一:

That is really confusing: nanoseconds are all unique, and that does not guarantee uniqueness of the final seed. To demonstrate that effect, here's one of the duplicates:

#            [,1]        [,2] 
#[1,] -1654969360 -1654969360
#[2,]   135644672   962643456
#[3,]  1397894128  1397894128 
#[4,]        2057        2057
bitwShiftL(135644672, 16)
#[1] -973078528
bitwShiftL(962643456, 16)
#[1] -973078528

最后一点:这两个数字的二进制表示形式是

The final note: the binary representation of these two numbers and the shift is

00001000000101011100011000000000 << 16 => 1100011000000000 + 16 zeroes
00111001011000001100011000000000 << 16 => 1100011000000000 + 16 zeroes

是的,这确实是不想要的碰撞.

So yes, this is really an unwanted collision.

毕竟,最终结论是:set.seed(NULL)易受高负载的影响,并且不能保证在处理多个连续呼叫时不会发生冲突!

Well, after all that, the final conclusion is: set.seed(NULL) is vulnerable to high load and does not guarantee the absence of collisions when dealing with multiple consecutive calls!

这篇关于从全局R进程中隔离局部环境的随机性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆