胶带有时会跳过的任务:跨产品错误 [英] ducttape sometimes-skip task: cross-product error
问题描述
我正在尝试根据以下教程制作有时会跳过的胶带任务: http://nschneid.github.io/ducttape-crash-course/tutorial5.html
I'm trying a variant of sometimes-skip tasks for ducttape, based on the tutorial here: http://nschneid.github.io/ducttape-crash-course/tutorial5.html
([ducttape] [1]是基于Bash/Scala的工作流管理工具.)
([ducttape][1] is a Bash/Scala based workflow management tool.)
我正在尝试做一个交叉乘积,以对干净"数据和脏"数据执行task1
.想法是遍历相同的路径,但在某些情况下无需预处理.为此,我需要做任务的交叉产品.
I'm trying to do a cross-product to execute task1
on "clean" data and "dirty" data. The idea is to traverse the same path, but without preprocessing in some cases. To do this, I need to do a cross-product of tasks.
task cleanup < in=(Dirty: a=data/a b=data/b) > out {
prefix=$(cat $in)
echo "$prefix-clean" > $out
}
global {
data=(Data: dirty=(Dirty: a=data/a b=data/b) clean=(Clean: a=$out@cleanup b=$out@cleanup))
}
task task1 < in=$data > out
{
cat $in > $out
}
plan FinalTasks {
reach task1 via (Dirty: *) * (Data: *) * (Clean: *)
}
这是执行计划.我希望有6个任务,但是我要执行两个重复的任务.
Here is the execution plan. I would expect 6 tasks, but I have two duplicate tasks being executed.
$ ducttape skip.tape
ducttape 0.3
by Jonathan Clark
Loading workflow version history...
Have 7 previous workflow versions
Finding hyperpaths contained in plan...
Found 8 vertices implied by realization plan FinalTasks
Union of all planned vertices has size 8
Checking for completed tasks from versions 1 through 7...
Finding packages...
Found 0 packages
Checking for already built packages (if this takes a long time, consider switching to a local-disk git clone instead of a remote repository)...
Checking inputs...
Work plan (depth-first traversal):
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./cleanup/Baseline.baseline (Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./cleanup/Dirty.b (Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Baseline.baseline (Data.dirty+Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Dirty.b (Data.dirty+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Clean.b+Data.clean+Dirty.b (Clean.b+Data.clean+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Data.clean+Dirty.b (Clean.a+Data.clean+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Data.clean (Clean.a+Data.clean+Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Clean.b+Data.clean (Clean.b+Data.clean+Dirty.a)
Are you sure you want to run these 8 tasks? [y/n]
从下面的输出中删除符号链接,我的副本在这里:
Removing the symlinks from the output below, my duplicates are here:
$ head task1/*/out
==> Baseline.baseline/out <==
1
==> Clean.b+Data.clean/out <==
1-clean
==> Data.clean/out <==
1-clean
==> Clean.b+Data.clean+Dirty.b/out <==
2-clean
==> Data.clean+Dirty.b/out <==
2-clean
==> Dirty.b/out <==
2
具有ducttape
经验的人可以帮助我发现我的跨产品问题吗?
Could someone with experience with ducttape
assist me in finding my cross-product problem?
[1]: https://github.com/jhclark/ducttape
推荐答案
那么为什么我们要有4个涉及task1处的分支点Clean的实现,而不仅仅是两个?
So why do we have 4 realizations involving the branch point Clean at task1 instead of just two?
这个问题的答案是,管道内的分支点始终通过任务的所有传递相关性传播.因此,来自任务清理"的分支点脏"将通过clean=(Clean: a=$out@cleanup b=$out@cleanup)
传播.此时,变量"clean"包含原始"Dirty"和新引入的"Clean"分支点的叉积.
The answer to this question is that the in ducttape branch points are always propagated through all transitive dependencies of a task. So the branch point "Dirty" from the task "cleanup" is propagated through clean=(Clean: a=$out@cleanup b=$out@cleanup)
. At this point the variable "clean" contains the cross product of the original "Dirty" and the newly-introduced "Clean" branch point.
要做的最小改变就是改变
The minimal change to make is to change
clean=(Clean: a=$out@cleanup b=$out@cleanup)
到
clean=$out@cleanup
这将为您提供所需的实现数量,但是使用分支点名称"Dirty"只是为了控制您正在使用的输入数据集,这有点令人困惑-仅此最小的更改,两个实现任务清理"将是(脏:ab).
This would give you the desired number of realizations, but it's a bit confusing to use the branch point name "Dirty" just to control which input data set you're using -- with only this minimal change, the two realizations of the task "cleanup" would be (Dirty: a b).
像这样重构它,可能会使您的工作流变得更加笨拙:
It may make your workflow even more grokkable to refactor it like this:
global {
raw_data=(DataSet: a=data/a b=data/b)
}
task cleanup < in=$raw_data > out {
prefix=$(cat $in)
echo "$prefix-clean" > $out
}
global {
ready_data=(DoCleanup: no=$raw_data yes=$out@cleanup)
}
task task1 < in=$ready_data > out
{
cat $in > $out
}
plan FinalTasks {
reach task1 via (DataSet: *) * (DoCleanup: *)
}
这篇关于胶带有时会跳过的任务:跨产品错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!