胶带有时会跳过的任务:跨产品错误 [英] ducttape sometimes-skip task: cross-product error

查看:87
本文介绍了胶带有时会跳过的任务:跨产品错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试根据以下教程制作有时会跳过的胶带任务: http://nschneid.github.io/ducttape-crash-course/tutorial5.html

I'm trying a variant of sometimes-skip tasks for ducttape, based on the tutorial here: http://nschneid.github.io/ducttape-crash-course/tutorial5.html

([ducttape] [1]是基于Bash/Scala的工作流管理工具.)

([ducttape][1] is a Bash/Scala based workflow management tool.)

我正在尝试做一个交叉乘积,以对干净"数据和脏"数据执行task1.想法是遍历相同的路径,但在某些情况下无需预处理.为此,我需要做任务的交叉产品.

I'm trying to do a cross-product to execute task1 on "clean" data and "dirty" data. The idea is to traverse the same path, but without preprocessing in some cases. To do this, I need to do a cross-product of tasks.

task cleanup < in=(Dirty: a=data/a b=data/b) > out {
    prefix=$(cat $in)
    echo "$prefix-clean" > $out
}

global {
    data=(Data: dirty=(Dirty: a=data/a b=data/b) clean=(Clean: a=$out@cleanup b=$out@cleanup))
}

task task1 < in=$data > out 
{ 
    cat $in > $out
}

plan FinalTasks {
    reach task1 via (Dirty: *) * (Data: *) * (Clean: *)
}

这是执行计划.我希望有6个任务,但是我要执行两个重复的任务.

Here is the execution plan. I would expect 6 tasks, but I have two duplicate tasks being executed.

$ ducttape skip.tape
ducttape 0.3
by Jonathan Clark
Loading workflow version history...
Have 7 previous workflow versions
Finding hyperpaths contained in plan...
Found 8 vertices implied by realization plan FinalTasks
Union of all planned vertices has size 8
Checking for completed tasks from versions 1 through 7...
Finding packages...
Found 0 packages
Checking for already built packages (if this takes a long time, consider switching to a local-disk git clone instead of a remote repository)...
Checking inputs...
Work plan (depth-first traversal):
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./cleanup/Baseline.baseline (Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./cleanup/Dirty.b (Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Baseline.baseline (Data.dirty+Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Dirty.b (Data.dirty+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Clean.b+Data.clean+Dirty.b (Clean.b+Data.clean+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Data.clean+Dirty.b (Clean.a+Data.clean+Dirty.b)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Data.clean (Clean.a+Data.clean+Dirty.a)
RUN: /nfsmnt/hltfs0/data/nicruiz/slt/IWSLT13/analysis/workflow/tmp/./task1/Clean.b+Data.clean (Clean.b+Data.clean+Dirty.a)
Are you sure you want to run these 8 tasks? [y/n] 

从下面的输出中删除符号链接,我的副本在这里:

Removing the symlinks from the output below, my duplicates are here:

$ head task1/*/out
==> Baseline.baseline/out <==
1

==> Clean.b+Data.clean/out <==
1-clean
==> Data.clean/out <==
1-clean

==> Clean.b+Data.clean+Dirty.b/out <==
2-clean
==> Data.clean+Dirty.b/out <==
2-clean

==> Dirty.b/out <==
2

具有ducttape经验的人可以帮助我发现我的跨产品问题吗?

Could someone with experience with ducttape assist me in finding my cross-product problem?

  [1]: https://github.com/jhclark/ducttape

推荐答案

那么为什么我们要有4个涉及task1处的分支点Clean的实现,而不仅仅是两个?

So why do we have 4 realizations involving the branch point Clean at task1 instead of just two?

这个问题的答案是,管道内的分支点始终通过任务的所有传递相关性传播.因此,来自任务清理"的分支点脏"将通过clean=(Clean: a=$out@cleanup b=$out@cleanup)传播.此时,变量"clean"包含原始"Dirty"和新引入的"Clean"分支点的叉积.

The answer to this question is that the in ducttape branch points are always propagated through all transitive dependencies of a task. So the branch point "Dirty" from the task "cleanup" is propagated through clean=(Clean: a=$out@cleanup b=$out@cleanup). At this point the variable "clean" contains the cross product of the original "Dirty" and the newly-introduced "Clean" branch point.

要做的最小改变就是改变

The minimal change to make is to change

clean=(Clean: a=$out@cleanup b=$out@cleanup)

clean=$out@cleanup

这将为您提供所需的实现数量,但是使用分支点名称"Dirty"只是为了控制您正在使用的输入数据集,这有点令人困惑-仅此最小的更改,两个实现任务清理"将是(脏:ab).

This would give you the desired number of realizations, but it's a bit confusing to use the branch point name "Dirty" just to control which input data set you're using -- with only this minimal change, the two realizations of the task "cleanup" would be (Dirty: a b).

像这样重构它,可能会使您的工作流变得更加笨拙:

It may make your workflow even more grokkable to refactor it like this:

global {
    raw_data=(DataSet: a=data/a b=data/b)
}

task cleanup < in=$raw_data > out {
    prefix=$(cat $in)
    echo "$prefix-clean" > $out
}
global {
    ready_data=(DoCleanup: no=$raw_data yes=$out@cleanup)
}

task task1 < in=$ready_data > out 
{ 
    cat $in > $out
}

plan FinalTasks {
    reach task1 via (DataSet: *) * (DoCleanup: *)
}

这篇关于胶带有时会跳过的任务:跨产品错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆