“foreach”并行循环返回< NA> s [英] "foreach" parallel loop returns <NA>s
问题描述
我的目标是:根据每个列的值运行一些标签功能。然后返回具有节点名称,列名称和已处理标签的数据帧
使用正常的for循环工作流工作正常。然而,当我尝试在foreach循环中做同样的事情时,返回的结果是
(请注意:以下仅仅是原始数据集的抽象)
我不知道究竟是什么混乱了之间..如果你可以帮助我把这件事情排除出来,那将是非常棒的: - )
set.seed(12345)
options(stringsAsFactors = F)
#I.随机数据生成(原始数据是数据帧格式)
random.data = list()
random.data [[one]] = as.data.frame(matrix(data = runif(n = 15),ncol = 3))
random.data [[two]] = as.data.frame(matrix(data = runif(n = 15),ncol = 3))
random.data [[three]] = as.data.frame(matrix(data = runif(n = 15),ncol = 3))
#某些函数应用于每列以标记/分类值
valslabel = function(DataCOlumn){
if(mean(DataCOlumn)< 0.5)return(low)
return(高)
}
#III。在常规for循环中生成所需的输出:
desiredOutput = list()
for(frame.i in seq_along(random.data)){
frame = random.data [[frame.i]]
frame.name = names(random.data)[frame.i]
frame.results = data.frame(frame.name =字符(0),
mappedField =字符(0),标签=字符(0))
(1中的col.i:ncol(frame)){
.results [col.i,frame.name] = frame.name
frame.results [col.i,mappedField] = colnames(frame)[col.i]
frame.results [col.i,label] = valslabel(frame [,col.i])
}
desiredOutput [[frame.name]] = frame.results
}
打印(expectedOutput)
#$ one
#frame.name mappedField标签
#1一个V1高
#2一个V2高
#3一个V3低
#
#$两
#frame.name mappedField标签
#1两个V1低
# 2两个V2高
#3两个V3低
#
#$三
#frame.name mappedField标签
#1三V1低
#2三个V2高
#3三V3高
#使用foreach并行执行
库(foreach)
库(doParallel)
cl = makeCluster(6)
registerDoParallel(cl)
output = foreach(frame.i = seq_along(random.data),.verbose = T)%dopar%{
frame = random.data [[frame.i] ]
frame.name = names(random.data)[frame.i]
frame.results = data.frame(frame.name = character(0),mappedField = character(0),label =字符(0))
for(col.i in 1:ncol(frame)){
frame.results [col.i,frame.name] = frame.name
frame.results [col.i,mappedField] = colnames(frame)[col.i]
frame.results [col.i,label] = valslabel(frame [,col.i ])
}
return(frame.results)
}
打印(输出)
#[[1]]
#frame.name mappedField label
#1< NA> < NA> < NA>
#2< NA> < NA> < NA>
#3< NA> < NA> < NA>
#
#[[2]]
#frame.name mappedField label
#1< NA> < NA> < NA>
#2< NA> < NA> < NA>
#3< NA> < NA> < NA>
#
#[[3]]
#frame.name mappedField label
#1< NA> < NA> < NA>
#2< NA> < NA> < NA>
#3< NA> < NA> < NA>
谢谢!
问题与您初始化数据框架的方式有关,事实上在 foreach
环境中,选项 stringsAsFactors
未设置为 FALSE
。每个 foreach
循环中发生的是这样的事情
options stringAsFactors = FALSE)
d< - data.frame(x = character(0))
d [1,x]< - a
#警告消息:
#In`[< - 。factor`(`* tmp *`,iseq,value =a):
#无效因子级别,NA生成
d
#x
#1< NA>
请注意,这只会发出警告,而不是错误,因此循环不会停止。如果您将 stringsAsFactors
设置为 FALSE
首先没有问题(正如没有并行运行的东西一样) p>
options(stringsAsFactors = FALSE)
d< - data.frame(x = character(0))
d [1,x]< - a
d
#x
#1 a
在您的全球环境中,您已经设置了 options(stringsAsFactors = FALSE)
,所以%do%
循环工作。但是,这个选项并不能在每个并行作业的本地环境中传递,所以%dopar%
循环遇到上述问题。
查找以下输出的
options(stringsAsFactors = FALSE)
。选项$ stringsAsFactors
#[1] FALSE
foreach(i = 1:3)%dopar%.Options $ stringsAsFactors
#[[1]]
#[1] TRUE
#
#[[2]]
#[1] TRUE
#
#[[3]]
#[1] TRUE
所以解决方案是设置选项 stringsAsFactors = FALSE
foreach
循环内。
除此之外,使用整体创建数据框更好列向量,而不是逐行。在您的示例中,您可以替换
frame.results = data.frame(frame.name = character(0),mappedField = character (0),label = character(0))
for(col.i in 1:ncol(frame)){
frame.results [col.i,frame.name] = frame。 name
frame.results [col.i,mappedField] = colnames(frame)[col.i]
frame.results [col.i,label] = valslabel(frame [,col .i])
}
与
frame.results< - data.frame(
frame.name = frame.name,
mappedField = colnames(frame),
label = valslabel1(colMeans(frame)))
其中 valslabel
函数已被矢量化版本替换
valslabel1< - function(x){
ifelse(x <0.5,low,high)
}
I am trying to process several list items in parallel.
My goal is to: run some labeling function on every column, based on its values. Then return dataframe with the node name, column name, and the processed label
The workflow works fine using a normal for loop. However, when I try to do the same thing in a foreach loop, the results returned are (Please note: the following is just an abstraction of the original dataset)
I am not sure what exactly is getting messed up in between.. If you can help me to sort that thing out that would be awesome :-)
set.seed(12345)
options(stringsAsFactors = F)
# I. Random data generation (Original data is in data frame format)
random.data = list()
random.data[["one"]] = as.data.frame(matrix(data = runif(n = 15), ncol = 3))
random.data[["two"]] = as.data.frame(matrix(data = runif(n = 15), ncol = 3))
random.data[["three"]] = as.data.frame(matrix(data = runif(n = 15), ncol = 3))
# II. Some function applied to each column to label/classify the values
valslabel = function(DataCOlumn) {
if(mean(DataCOlumn) < 0.5) return("low")
return("high")
}
# III. Generating the desired output in a regular for loop :
desiredOutput = list()
for(frame.i in seq_along(random.data)) {
frame = random.data[[frame.i]]
frame.name = names(random.data)[frame.i]
frame.results = data.frame(frame.name = character(0),
mappedField = character(0), label = character(0) )
for(col.i in 1:ncol(frame)) {
frame.results[col.i, "frame.name"] = frame.name
frame.results[col.i, "mappedField"] = colnames(frame)[col.i]
frame.results[col.i, "label"] = valslabel(frame[,col.i])
}
desiredOutput[[frame.name]] = frame.results
}
print(desiredOutput)
# $one
# frame.name mappedField label
# 1 one V1 high
# 2 one V2 high
# 3 one V3 low
#
# $two
# frame.name mappedField label
# 1 two V1 low
# 2 two V2 high
# 3 two V3 low
#
# $three
# frame.name mappedField label
# 1 three V1 low
# 2 three V2 high
# 3 three V3 high
# IV. Using the "foreach" parallel execution
library(foreach)
library(doParallel)
cl = makeCluster(6)
registerDoParallel(cl)
output = foreach(frame.i = seq_along(random.data), .verbose = T) %dopar% {
frame = random.data[[frame.i]]
frame.name = names(random.data)[frame.i]
frame.results = data.frame(frame.name = character(0), mappedField = character(0), label = character(0) )
for(col.i in 1:ncol(frame)) {
frame.results[col.i, "frame.name"] = frame.name
frame.results[col.i, "mappedField"] = colnames(frame)[col.i]
frame.results[col.i, "label"] = valslabel(frame[,col.i])
}
return(frame.results)
}
print(output)
# [[1]]
# frame.name mappedField label
# 1 <NA> <NA> <NA>
# 2 <NA> <NA> <NA>
# 3 <NA> <NA> <NA>
#
# [[2]]
# frame.name mappedField label
# 1 <NA> <NA> <NA>
# 2 <NA> <NA> <NA>
# 3 <NA> <NA> <NA>
#
# [[3]]
# frame.name mappedField label
# 1 <NA> <NA> <NA>
# 2 <NA> <NA> <NA>
# 3 <NA> <NA> <NA>
Thanks!
The problem is related to the way you initialise your data frame, and the fact that within the foreach
environment, the option stringsAsFactors
is not set to FALSE
. What is happening in each foreach
loop is something like this
options(stringsAsFactors = FALSE)
d <- data.frame(x =character(0))
d[1, "x"] <- "a"
#Warning message:
#In `[<-.factor`(`*tmp*`, iseq, value = "a") :
# invalid factor level, NA generated
d
# x
#1 <NA>
Note that this only gives a warning, and not an error so the loop doesn't stop. If you set stringsAsFactors
to FALSE
first there is no problem (as you did when not running stuff in parallel)
options(stringsAsFactors = FALSE)
d <- data.frame(x =character(0))
d[1, "x"] <- "a"
d
# x
#1 a
In your global environment you already set options(stringsAsFactors = FALSE)
so the %do%
loop worked. However this option does not get passed in the local environment of each parallel job and so the %dopar%
loop runs into the problem above.
Look for example at the output of the following
options(stringsAsFactors = FALSE)
.Options$stringsAsFactors
#[1] FALSE
foreach(i = 1:3) %dopar% .Options$stringsAsFactors
#[[1]]
#[1] TRUE
#
#[[2]]
#[1] TRUE
#
#[[3]]
#[1] TRUE
So the solution is to set the option stringsAsFactors = FALSE
inside the foreach
loop.
As an aside, it is much better to create your data frame using the whole column vector rather than row-by-row when possible. In your example you can replace
frame.results = data.frame(frame.name = character(0), mappedField = character(0), label = character(0))
for(col.i in 1:ncol(frame)) {
frame.results[col.i, "frame.name"] = frame.name
frame.results[col.i, "mappedField"] = colnames(frame)[col.i]
frame.results[col.i, "label"] = valslabel(frame[,col.i])
}
with
frame.results <- data.frame(
frame.name = frame.name,
mappedField = colnames(frame),
label = valslabel1(colMeans(frame)))
where the valslabel
function has been replaced by a vectorised version
valslabel1 <- function(x) {
ifelse(x < 0.5, "low", "high")
}
这篇关于“foreach”并行循环返回< NA> s的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!