基于公共密钥的简单数据分析方法 [英] Simple way to analyze data based on common key

查看:108
本文介绍了基于公共密钥的简单数据分析方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

处理映射到特定键的所有记录并为该数据输出多个记录的最简单方法是什么.

What would be the simplest way to process all the records that were mapped to a specific key and output multiple records for that data.

例如(合成示例),假设我的密钥是日期,并且值是具有测量温度的日内时间戳记.我想将一天中的温度分类为高/平均/低(再次是,平均温度低于/高于标准温度1 stddev).

For example (a synthetic example), assuming my key is a date and the values are intra-day timestamps with measured temperatures. I'd like to classify the temperatures into high/average/low within the day (again, below/above 1 stddev from average).

输出将是原始温度及其新的分类.

The output would be the original temperatures with their new classifications.

使用Combine.PerKey(CombineFn),使用#extractOutput()方法每个键只允许一个输出.

Using Combine.PerKey(CombineFn) allows only one output per key using the #extractOutput() method.

谢谢

推荐答案

CombineFns被限制为单个输出值,因为它允许系统进行附加的并行化:分别组合值的不同子集,然后将其中间结果组合到一个任意的树缩减模式,直到为每个键生成单个结果值为止.

CombineFns are restricted to a single output value because that allows the system to do additional parallelization: combining different subsets of the values separately, and then combining their intermediate results in an arbitrary tree reduction pattern, until a single result value is produced for each key.

如果每个键的值不适合内存(因此您不能使用Jeremy建议的GroupByKey-ParDo模式),但是计算出的统计信息确实适合内存,您还可以执行以下操作: (1)使用Combine.perKey()计算每天的统计信息 (2)使用View.asIterable()将它们转换为PCollectionViews. (3)使用ParDo重新处理原始输入,该ParDo将统计信息作为辅助输入 (4)在该ParDo的DoFn中,让startBundle()接受侧面输入并建立一个内存中的数据结构,将日期映射到可用于在processElement中进行查找的统计信息.

If your values per key don't fit in memory (so you can't use the GroupByKey-ParDo pattern that Jeremy suggests) but the computed statistics do fit in memory, you could also do something like this: (1) Use Combine.perKey() to calculate the stats per day (2) Use View.asIterable() to convert those into PCollectionViews. (3) Reprocess the original input with a ParDo that takes the statistics as side inputs (4) In that ParDo's DoFn, have startBundle() take the side inputs and build up an in-memory data structure mapping days to statistics that can be used to do lookups in processElement.

这篇关于基于公共密钥的简单数据分析方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆