并行parLapply设置 [英] parallel parLapply setup
问题描述
我正在尝试并行使用来自openNLP/NLP包的部分语音标记.我需要在任何OS上都能运行的代码,因此选择并行使用parLapply
函数(但对其他OS无关的选项开放).过去,我从parLapply
的openNLP包中运行tagPOS
函数没有问题.但是,openNLP软件包最近进行了一些更改,从而取消了tagPOS
并添加了一些更灵活的选项. Kurt非常友好,可以帮助我从新程序包的工具中重新创建tagPOS
函数.我可以使用lapply
版本,但不能使用并行版本.它一直说节点需要更多的变量传递给它们,直到最终要求openNLP提供非导出的功能.这似乎很奇怪,它会不断要求传递越来越多的变量,这告诉我我错误地设置了parLapply
.如何设置tagPOS
以并行,独立于操作系统的方式运行?
I am trying to use part of speech tagging from the openNLP/NLP packages in parallel. I need the code to work on any OS so am opting to use the parLapply
function from parallel (but am open to other OS independent options). In the past I ran tagPOS
function from the openNLP package in parLapply
with no problem. However, the openNLP package had some recent changes that eliminated tagPOS
and added some more flexible options. Kurt was kind enough to help me recreate the tagPOS
function from the new package's tools. I can get the lapply
version to work but not the parallel version. It keeps saying the nodes need more variables passed to them until it finally asks for a non-exported function from openNLP. This seems odd it would keep asking for more and more variables to be passed which tells me I'm setting up the parLapply
incorrectly. How can I set up the tagPOS
to operate in an parallel, OS independent fashion?
library(openNLP)
library(NLP)
library(parallel)
## POS tagger
tagPOS <- function(x, pos_tag_annotator, ...) {
s <- as.String(x)
## Need sentence and word token annotations.
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, pos_tag_annotator, a2)
## Determine the distribution of POS tags for word tokens.
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
## Extract token/POS pairs (all of them): easy.
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)
} ## End of tagPOS function
## Set up a parallel run
text.var <- c("I like it.", "This is outstanding soup!",
"I really must get the recipe.")
ntv <- length(text.var)
PTA <- Maxent_POS_Tag_Annotator()
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2))
clusterExport(cl=cl, varlist=c("text.var", "ntv",
"tagPOS", "PTA", "as.String", "Maxent_Word_Token_Annotator"),
envir = environment())
m <- parLapply(cl, seq_len(ntv), function(i) {
x <- tagPOS(text.var[i], PTA)
return(x)
}
)
stopCluster(cl)
## Error in checkForRemoteErrors(val) :
## 3 nodes produced errors; first error: could not find function
## "Maxent_Simple_Word_Tokenizer"
openNLP::Maxent_Simple_Word_Tokenizer
## >openNLP::Maxent_Simple_Word_Tokenizer
## Error: 'Maxent_Simple_Word_Tokenizer' is not an exported
## object from 'namespace:openNLP'
## It's a non exported function
openNLP:::Maxent_Simple_Word_Tokenizer
## Demo that it works with lapply
lapply(seq_len(ntv), function(i) {
tagPOS(text.var[i], PTA)
})
lapply(text.var, function(x) {
tagPOS(x, PTA)
})
## > lapply(seq_len(ntv), function(i) {
## + tagPOS(text.var[i], PTA)
## + })
## [[1]]
## [[1]]$POStagged
## [1] "I/PRP like/IN it/PRP ./."
##
## [[1]]$POStags
## [1] "PRP" "IN" "PRP" "."
##
## [[1]]$word.count
## [1] 3
##
##
## [[2]]
## [[2]]$POStagged
## [1] "THis/DT is/VBZ outstanding/JJ soup/NN !/."
##
## [[2]]$POStags
## [1] "DT" "VBZ" "JJ" "NN" "."
##
## [[2]]$word.count
## [1] 4
##
##
## [[3]]
## [[3]]$POStagged
## [1] "I/PRP really/RB must/MD get/VB the/DT recip/NN ./."
##
## [[3]]$POStags
## [1] "PRP" "RB" "MD" "VB" "DT" "NN" "."
##
## [[3]]$word.count
## [1] 6
根据史蒂夫的建议
请注意,openNLP是全新的.我从CRAN的tar.gz安装了2.1版.即使此功能存在,我也会收到以下错误消息.
Note the openNLP is brand new. I installed ver 2.1 from a tar.gz from CRAN. I get the following error even though this function exists.
library(openNLP); library(NLP); library(parallel)
tagPOS <- function(text.var, pos_tag_annotator, ...) {
s <- as.String(text.var)
## Set up the POS annotator if missing (for parallel)
if (missing(pos_tag_annotator)) {
PTA <- Maxent_POS_Tag_Annotator()
}
## Need sentence and word token annotations.
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, PTA, a2)
## Determine the distribution of POS tags for word tokens.
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, "[[", "POS"))
## Extract token/POS pairs (all of them): easy.
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)
}
text.var <- c("I like it.", "This is outstanding soup!",
"I really must get the recipe.")
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2))
clusterEvalQ(cl, {library(openNLP); library(NLP)})
m <- parLapply(cl, text.var, tagPOS)
## > m <- parLapply(cl, text.var, tagPOS)
## Error in checkForRemoteErrors(val) :
## 3 nodes produced errors; first error: could not find function "Maxent_POS_Tag_Annotator"
stopCluster(cl)
> packageDescription('openNLP')
Package: openNLP
Encoding: UTF-8
Version: 0.2-1
Title: Apache OpenNLP Tools Interface
Authors@R: person("Kurt", "Hornik", role = c("aut", "cre"), email =
"Kurt.Hornik@R-project.org")
Description: An interface to the Apache OpenNLP tools (version 1.5.3). The Apache OpenNLP
library is a machine learning based toolkit for the processing of natural language
text written in Java. It supports the most common NLP tasks, such as tokenization,
sentence segmentation, part-of-speech tagging, named entity extraction, chunking,
parsing, and coreference resolution. See http://opennlp.apache.org/ for more
information.
Imports: NLP (>= 0.1-0), openNLPdata (>= 1.5.3-1), rJava (>= 0.6-3)
SystemRequirements: Java (>= 5.0)
License: GPL-3
Packaged: 2013-08-20 13:23:54 UTC; hornik
Author: Kurt Hornik [aut, cre]
Maintainer: Kurt Hornik <Kurt.Hornik@R-project.org>
NeedsCompilation: no
Repository: CRAN
Date/Publication: 2013-08-20 15:41:22
Built: R 3.0.1; ; 2013-08-20 13:48:47 UTC; windows
推荐答案
由于您是从集群工作程序上的NLP
调用函数的,因此应在调用parLapply
之前将其加载到每个工作程序上.您可以通过worker函数来实现,但是我倾向于在创建集群对象之后立即使用clusterCall
或clusterEvalQ
:
Since you're calling functions from NLP
on the cluster workers, you should load it on each of the workers before calling parLapply
. You can do that from the worker function, but I tend to use clusterCall
or clusterEvalQ
right after creating the cluster object:
clusterEvalQ(cl, {library(openNLP); library(NLP)})
由于as.String
和Maxent_Word_Token_Annotator
在这些程序包中,因此不应将其导出.
Since as.String
and Maxent_Word_Token_Annotator
are in those packages, they shouldn't be exported.
请注意,在我的计算机上运行示例时,我注意到PTA
对象在导出到辅助计算机后不起作用.大概在那个对象中有一些不能安全地序列化和反序列化的东西.使用clusterEvalQ
在工作人员上创建该对象后,该示例成功运行.使用openNLP 0.2-1在这里:
Note that while running your example on my machine, I noticed that the PTA
object doesn't work after being exported to the worker machines. Presumably there is something in that object that can't be safely serialized and unserialized. After I created that object on the workers using clusterEvalQ
, the example ran successfully. Here it is, using openNLP 0.2-1:
library(parallel)
tagPOS <- function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, PTA, a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)
}
text.var <- c("I like it.", "This is outstanding soup!",
"I really must get the recipe.")
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2))
clusterEvalQ(cl, {
library(openNLP)
library(NLP)
PTA <- Maxent_POS_Tag_Annotator()
})
m <- parLapply(cl, text.var, tagPOS)
print(m)
stopCluster(cl)
如果由于未找到Maxent_POS_Tag_Annotator而导致clusterEvalQ
失败,则可能是在工作程序上加载了错误版本的openNLP.您可以通过使用clusterEvalQ
和clusterEvalQ
来确定要使用的软件包版本:
If clusterEvalQ
fails because Maxent_POS_Tag_Annotator is not found, you might be loading the wrong version of openNLP on the workers. You can determine what package versions you're getting on the workers by executing sessionInfo
with clusterEvalQ
:
library(parallel)
cl <- makeCluster(2)
clusterEvalQ(cl, {library(openNLP); library(NLP)})
clusterEvalQ(cl, sessionInfo())
这将返回在每个集群工作程序上执行sessionInfo()
的结果.这是我正在使用且对我有用的某些软件包的版本信息:
This will return the results of executing sessionInfo()
on each of the cluster workers. Here is the version information for some of the packages that I'm using and that work for me:
other attached packages:
[1] NLP_0.1-0 openNLP_0.2-1
loaded via a namespace (and not attached):
[1] openNLPdata_1.5.3-1 rJava_0.9-4
这篇关于并行parLapply设置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!