转到SDK Apache Beam:用于整体定义的Singleton侧面输入Singleton [英] Go SDK Apache Beam : singleton side input Singleton for int ill-defined

查看:38
本文介绍了转到SDK Apache Beam:用于整体定义的Singleton侧面输入Singleton的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用用于Apache Beam的Go SDK,我尝试使用侧面输入创建PCollection的视图.

Using the Go SDK for Apache Beam, I'm trying to create a view of a PCollection using a side input.

但是我遇到了这个奇怪的错误:

But I'm getting this weird error:

Failed to execute job: on ctx=      making side input 0:
singleton side input Singleton for int ill-defined
exit status 1

这是我正在使用的代码:

Here the code I'm using:

// A PCollection of key/value pairs
pairedWithOne := beam.ParDo(s, func(r models.Review) (string, int) {
        return r.DoRecommend, 1
    }, col)

// A PCollection of ints (demo)
pcollInts := beam.CreateList(s, [3]int{
        1, 2, 3,
})

// A PCollection of key/values pairs
summed := stats.SumPerKey(s, pairedWithOne)

// Here is where I'd like to use my side input.
mapped := beam.ParDo(s, func(k string, v int, side int, emit func(ratio 
models.RecommendRatio)) {
        var ratio = models.RecommendRatio{
            DoRecommend: k,
            NumVotes:    v,
        }

        emit(ratio)
    }, summed, beam.SideInput{Input: pcollInts})

我在 git :

// Side Inputs
//
// While a ParDo processes elements from a single "main input" PCollection, it
// can take additional "side input" PCollections. These SideInput along with
// the DoFn parameter form express styles of accessing PCollection computed by
// earlier pipeline operations, passed in to the ParDo transform using SideInput
// options, and their contents accessible to each of the DoFn operations. For
// example:
//
//     words := ...
//     cufoff := ...  // Singleton PCollection<int>
//     smallWords := beam.ParDo(s, func (word string, cutoff int, emit func(string)) {
//           if len(word) < cutoff {
//                emit(word)
//           }
//     }, words, beam.SideInput{Input: cutoff})

更新:似乎 Impulse(scope)函数在这里起作用,但我不知道是什么.从GoDoc:

update: It seems like the Impulse(scope) function has a role here but I cannot figure what. From GoDoc :

Impulse emits a single empty []byte into the global window. The resulting PCollection is a singleton of type []byte.

The purpose of Impulse is to trigger another transform, such as ones that take all information as side inputs.

如果这可以帮助,请在这里查看我的结构:

If this can help, here my structs:

type Review struct {
    Date        time.Time `csv:"date" json:"date"`
    DoRecommend string    `csv:"doRecommend" json:"doRecommend"`
    NumHelpful  int       `csv:"numHelpful" json:"numHelpful"`
    Rating      int       `csv:"rating" json:"rating"`
    Text        string    `csv:"text" json:"text"`
    Title       string    `csv:"title" json:"title"`
    Username    string    `csv:"username" json:"username"`
}

type RecommendRatio struct {
    DoRecommend string `json:"doRecommend"`
    NumVotes    int    `json:"numVotes"`
}

有什么解决办法吗?

谢谢

推荐答案

更新:

这可以通过删除 beam.Impulse()函数来简化(我认为错误的类型在这里引起了麻烦):

This can be simplified by removing the beam.Impulse() function (I think the wrong type caused the trouble here):

mapped := beam.ParDo(s,
        func(k string, v int,
            sideCounted int,
            emit func(ratio models.RecommendRatio)) {

            p := percent.PercentOf(v, sideCounted)

            emit(models.RecommendRatio{
                DoRecommend: k,
                NumVotes:    v,
                Percent:     p,
            })

        }, summed,
        beam.SideInput{Input: counted})

旧:似乎我已经找到了解决方案,也许只是一种解决方法,寻求快速审核并为改进留有余地.(我认为该函数不是幂等的,因为如果它可能在多个节点工作程序上执行多次,则append()函数将复制条目...)

Old: Seems like I've found a solution, maybe just a workaround, looking for a quick review and open to room for improvements. (I believe that function isnt idempotent because if it may executed more than once on multiple node workers, the append() function will duplicate entries...)

但是这里的全局想法是使用 beam.Impulse(scope)函数创建 [] uint8字节的单例PCollection,并传递所有真实"数据作为辅助输入.

But the global idea here is to make a singleton PCollection of a []uint8 byte using beam.Impulse(scope) function and pass all the "real" data as a side inputs.

    // Pair each recommendation value with one -> PColl<KV<string, int>>
    pairedWithOne := beam.ParDo(s, func(r models.Review) (string, int) {
        return r.DoRecommend, 1
    }, col)

    // Sum num occurrences of a recommendation k/v pair
    summed := stats.SumPerKey(s, pairedWithOne)

    // Drop keys for latter global count
    droppedKey := beam.DropKey(s, pairedWithOne)

    // Count globally the number of recommendation values -> PColl<int>
    counted := stats.Sum(s, droppedKey)

    // Map to a struct with percentage per ratio
    mapped := beam.ParDo(s,
        func(_ []uint8,
            sideSummed func(k *string, v *int) bool,
            sideCounted int,
            emit func(ratio []models.RecommendRatio)) {

            var k string
            var v int
            var ratios []models.RecommendRatio


            for sideSummed(&k, &v) {
                p := percent.PercentOf(v, sideCounted)

                ratio := models.RecommendRatio{
                    DoRecommend: k,
                    NumVotes:    v,
                    Percent:     p,
                }

                ratios = append(ratios, ratio)
            }

            emit(ratios)

        }, beam.Impulse(s),
        beam.SideInput{Input: summed},
        beam.SideInput{Input: counted})

这篇关于转到SDK Apache Beam:用于整体定义的Singleton侧面输入Singleton的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆