缩放:如何在groupBy('field){.size}之后保留另一个字段? [英] Scalding: How to retain the other field, after a groupBy('field){.size}?

查看:146
本文介绍了缩放:如何在groupBy('field){.size}之后保留另一个字段?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我的输入数据有两个字段/列:id1& id2,我的代码如下:

So my input data has two fields/columns: id1 & id2, and my code is the following:

TextLine(args("input"))
.read
.mapTo('line->('id1,'id2)) {line: String =>
    val fields = line.split("\t")
        (fields(0),fields(1))
}
.groupBy('id2){.size}
.write(Tsv(args("output")))

输出导致(我假设)两个字段:id2 * size.我有点想知道是否可以保留与id2一起分组的id1值,并将其添加为另一个字段?

The output results in (what i assume) two fields: id2 * size. I'm a little stuck on finding out if it is possible to retain the id1 value that was also grouped with id2 and add it as another field?

推荐答案

恐怕您无法以一种很好的方式做到这一点.考虑一下它是如何工作的-将要计数的数据拆分为多个块,然后将其发送到不同的进程,每个进程都对其进行计数,然后由单个化简器将它们全部加到最后.当每个进程都在计数时,它不知道整个大小,因此无法在其上添加字段.唯一的方法是在知道整个大小(即联接)后返回并添加到数据中.

You can't do this in a nice way I'm afraid. Think about how it works under the hood - it splits the data to be counted into chunks and sends it off to different processes, each process counts it's chunk, then a single reducer adds them all up at the end. While each process is counting it doesn't know the entire size so it can't add the field on. The only way is to go back and add it to the data once the entire size is known (i.e. a join).

如果每个组都适合内存(并且您可以配置内存),则可以:

If each group fits in memory (and you can configure the memory), you can:

Tsv(args("input"), ('id1, 'id2))
.groupBy('id2)(_.size.toList[(String, String)](('id1, 'id2) -> 'list))
.flatMapTo[(Iterable[(String, String)], Int), (String, String, Int)](('list, 'size) -> ('id1, 'id2, 'size)) {
  case (list, size) => list.map(record => (record._1, record._2, size))
}
.write(Tsv(args("output")))

但是,如果您的系统没有足够的内存,则必须使用昂贵的连接.

But if your system doesn't have enough memory, you will have to use an expensive join.

备注: 您可以使用Tsv代替TextLine,然后使用mapTo和split.

Remark: You can use Tsv instead of TextLine followed by mapTo and splitting.

这篇关于缩放:如何在groupBy('field){.size}之后保留另一个字段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆