“foreach”并行循环返回< NA> s [英] "foreach" parallel loop returns <NA>s

查看:111
本文介绍了“foreach”并行循环返回< NA> s的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试并行处理几个列表项。



我的目标是:根据每个列的值运行一些标签功能。然后返回具有节点名称,列名称和已处理标签的数据帧



使用正常的for循环工作流工作正常。然而,当我尝试在foreach循环中做同样的事情时,返回的结果是
(请注意:以下仅仅是原始数据集的抽象)



我不知道究竟是什么混乱了之间..如果你可以帮助我把这件事情排除出来,那将是非常棒的: - )

  set.seed(12345)
options(stringsAsFactors = F)


#I.随机数据生成(原始数据是数据帧格式)
random.data = list()
random.data [[one]] = as.data.frame(matrix(data = runif(n = 15),ncol = 3))
random.data [[two]] = as.data.frame(matrix(data = runif(n = 15),ncol = 3))
random.data [[three]] = as.data.frame(matrix(data = runif(n = 15),ncol = 3))



#某些函数应用于每列以标记/分类值
valslabel = function(DataCOlumn){
if(mean(DataCOlumn)< 0.5)return(low)
return(高)
}



#III。在常规for循环中生成所需的输出:

desiredOutput = list()

for(frame.i in seq_along(random.data)){

frame = random.data [[frame.i]]
frame.name = names(random.data)[frame.i]
frame.results = data.frame(frame.name =字符(0),
mappedField =字符(0),标签=字符(0))

(1中的col.i:ncol(frame)){
.results [col.i,frame.name] = frame.name
frame.results [col.i,mappedField] = colnames(frame)[col.i]
frame.results [col.i,label] = valslabel(frame [,col.i])
}

desiredOutput [[frame.name]] = frame.results
}


打印(expectedOutput)

#$ one
#frame.name mappedField标签
#1一个V1高
#2一个V2高
#3一个V3低

#$两
#frame.name mappedField标签
#1两个V1低
# 2两个V2高
#3两个V3低

#$三
#frame.name mappedField标签
#1三V1低
#2三个V2高
#3三V3高




#使用foreach并行执行

库(foreach)
库(doParallel)

cl = makeCluster(6)
registerDoParallel(cl)

output = foreach(frame.i = seq_along(random.data),.verbose = T)%dopar%{

frame = random.data [[frame.i] ]
frame.name = names(random.data)[frame.i]
frame.results = data.frame(frame.name = character(0),mappedField = character(0),label =字符(0))

for(col.i in 1:ncol(frame)){
frame.results [col.i,frame.name] = frame.name
frame.results [col.i,mappedField] = colnames(frame)[col.i]
frame.results [col.i,label] = valslabel(frame [,col.i ])
}

return(frame.results)
}


打印(输出)

#[[1]]
#frame.name mappedField label
#1< NA> < NA> < NA>
#2< NA> < NA> < NA>
#3< NA> < NA> < NA>

#[[2]]
#frame.name mappedField label
#1< NA> < NA> < NA>
#2< NA> < NA> < NA>
#3< NA> < NA> < NA>

#[[3]]
#frame.name mappedField label
#1< NA> < NA> < NA>
#2< NA> < NA> < NA>
#3< NA> < NA> < NA>

谢谢!

解决方案

问题与您初始化数据框架的方式有关,事实上在 foreach 环境中,选项 stringsAsFactors 未设置为 FALSE 。每个 foreach 循环中发生的是这样的事情

  options stringAsFactors = FALSE)
d< - data.frame(x = character(0))
d [1,x]< - a
#警告消息:
#In`[< - 。factor`(`* tmp *`,iseq,value =a):
#无效因子级别,NA生成
d
#x
#1< NA>

请注意,这只会发出警告,而不是错误,因此循环不会停止。如果您将 stringsAsFactors 设置为 FALSE 首先没有问题(正如没有并行运行的东西一样) p>

  options(stringsAsFactors = FALSE)
d< - data.frame(x = character(0))
d [1,x]< - a
d
#x
#1 a

在您的全球环境中,您已经设置了 options(stringsAsFactors = FALSE),所以%do%循环工作。但是,这个选项并不能在每个并行作业的本地环境中传递,所以%dopar%循环遇到上述问题。



查找以下输出的

  options(stringsAsFactors = FALSE)
。选项$ stringsAsFactors
#[1] FALSE
foreach(i = 1:3)%dopar%.Options $ stringsAsFactors
#[[1]]
#[1] TRUE

#[[2]]
#[1] TRUE

#[[3]]
#[1] TRUE

所以解决方案是设置选项 stringsAsFactors = FALSE foreach 循环内。



除此之外,使用整体创建数据框更好列向量,而不是逐行。在您的示例中,您可以替换

  frame.results = data.frame(frame.name = character(0),mappedField = character (0),label = character(0))
for(col.i in 1:ncol(frame)){
frame.results [col.i,frame.name] = frame。 name
frame.results [col.i,mappedField] = colnames(frame)[col.i]
frame.results [col.i,label] = valslabel(frame [,col .i])
}

  frame.results<  -  data.frame(
frame.name = frame.name,
mappedField = colnames(frame),
label = valslabel1(colMeans(frame)))

其中 valslabel 函数已被矢量化版本替换

  valslabel1<  -  function(x){
ifelse(x <0.5,low,high)
}


I am trying to process several list items in parallel.

My goal is to: run some labeling function on every column, based on its values. Then return dataframe with the node name, column name, and the processed label

The workflow works fine using a normal for loop. However, when I try to do the same thing in a foreach loop, the results returned are (Please note: the following is just an abstraction of the original dataset)

I am not sure what exactly is getting messed up in between.. If you can help me to sort that thing out that would be awesome :-)

set.seed(12345)
options(stringsAsFactors = F)


# I. Random data generation (Original data is in data frame format)
random.data = list()
random.data[["one"]] = as.data.frame(matrix(data = runif(n = 15), ncol = 3))
random.data[["two"]] = as.data.frame(matrix(data = runif(n = 15), ncol = 3))
random.data[["three"]] = as.data.frame(matrix(data = runif(n = 15), ncol = 3))



# II. Some function applied to each column to label/classify the values
valslabel = function(DataCOlumn) {
  if(mean(DataCOlumn) < 0.5) return("low")
  return("high")
}



# III. Generating the desired output in a regular for loop : 

desiredOutput = list()

for(frame.i in seq_along(random.data)) {

  frame = random.data[[frame.i]]
  frame.name = names(random.data)[frame.i]
  frame.results = data.frame(frame.name = character(0), 
                  mappedField = character(0), label = character(0) )

  for(col.i in 1:ncol(frame)) {
    frame.results[col.i, "frame.name"] = frame.name
    frame.results[col.i, "mappedField"] = colnames(frame)[col.i]
    frame.results[col.i, "label"] = valslabel(frame[,col.i])  
  }

  desiredOutput[[frame.name]] = frame.results
}


print(desiredOutput)

# $one
# frame.name mappedField label
# 1        one          V1  high
# 2        one          V2  high
# 3        one          V3   low
# 
# $two
# frame.name mappedField label
# 1        two          V1   low
# 2        two          V2  high
# 3        two          V3   low
# 
# $three
# frame.name mappedField label
# 1      three          V1   low
# 2      three          V2  high
# 3      three          V3  high




# IV. Using the "foreach" parallel execution

library(foreach)
library(doParallel)

cl = makeCluster(6)
registerDoParallel(cl)

output = foreach(frame.i = seq_along(random.data), .verbose = T) %dopar% {

  frame = random.data[[frame.i]]
  frame.name = names(random.data)[frame.i]
  frame.results = data.frame(frame.name = character(0), mappedField = character(0), label = character(0) )

  for(col.i in 1:ncol(frame)) {
    frame.results[col.i, "frame.name"] = frame.name
    frame.results[col.i, "mappedField"] = colnames(frame)[col.i]
    frame.results[col.i, "label"] = valslabel(frame[,col.i])  
  }

  return(frame.results)
}


print(output)

# [[1]]
# frame.name mappedField label
# 1       <NA>        <NA>  <NA>
# 2       <NA>        <NA>  <NA>
# 3       <NA>        <NA>  <NA>
#   
# [[2]]
# frame.name mappedField label
# 1       <NA>        <NA>  <NA>
# 2       <NA>        <NA>  <NA>
# 3       <NA>        <NA>  <NA>
#   
# [[3]]
# frame.name mappedField label
# 1       <NA>        <NA>  <NA>
# 2       <NA>        <NA>  <NA>
# 3       <NA>        <NA>  <NA>

Thanks!

解决方案

The problem is related to the way you initialise your data frame, and the fact that within the foreach environment, the option stringsAsFactors is not set to FALSE. What is happening in each foreach loop is something like this

options(stringsAsFactors = FALSE)
d <- data.frame(x =character(0))
d[1, "x"] <- "a"
#Warning message:
#In `[<-.factor`(`*tmp*`, iseq, value = "a") :
#  invalid factor level, NA generated
d
#     x
#1 <NA>

Note that this only gives a warning, and not an error so the loop doesn't stop. If you set stringsAsFactors to FALSE first there is no problem (as you did when not running stuff in parallel)

options(stringsAsFactors = FALSE)
d <- data.frame(x =character(0))
d[1, "x"] <- "a"
d
#  x
#1 a

In your global environment you already set options(stringsAsFactors = FALSE) so the %do% loop worked. However this option does not get passed in the local environment of each parallel job and so the %dopar% loop runs into the problem above.

Look for example at the output of the following

options(stringsAsFactors = FALSE)
.Options$stringsAsFactors
#[1] FALSE
foreach(i = 1:3) %dopar% .Options$stringsAsFactors
#[[1]]
#[1] TRUE
#
#[[2]]
#[1] TRUE
#
#[[3]]
#[1] TRUE

So the solution is to set the option stringsAsFactors = FALSE inside the foreach loop.

As an aside, it is much better to create your data frame using the whole column vector rather than row-by-row when possible. In your example you can replace

frame.results = data.frame(frame.name = character(0), mappedField = character(0), label = character(0))
for(col.i in 1:ncol(frame)) {
    frame.results[col.i, "frame.name"] = frame.name
    frame.results[col.i, "mappedField"] = colnames(frame)[col.i]
    frame.results[col.i, "label"] = valslabel(frame[,col.i])  
}

with

frame.results <- data.frame( 
    frame.name = frame.name, 
    mappedField = colnames(frame), 
    label = valslabel1(colMeans(frame)))

where the valslabel function has been replaced by a vectorised version

valslabel1 <- function(x) {
    ifelse(x < 0.5, "low", "high")
}

这篇关于“foreach”并行循环返回&lt; NA&gt; s的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆