直方图中的二维聚合 [英] Two-dimensional aggregation in Histogrammar

查看:102
本文介绍了直方图中的二维聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我发现的示例中,装仓仅对一维数据数组执行.我想对2D数据进行装箱,以模拟SQL的分组/聚合.使用直方图可以做到吗?

In the examples I found, binning is only performed on a 1D array of data. I would like to bin 2D data in order to simulating the groupby/aggregation of SQL. Is that possible using histogrammar?

(问题从Michel Page转贴了.)

(Question reposted from Michel Page.)

推荐答案

是的,可以通过嵌套1D聚合器来聚合2D数据.一个简单的例子是2D直方图:

Yes, it is possible to aggregate 2D data by nesting 1D aggregators. A simple example is a 2D histogram:

hist2d = Bin(numX, lowX, highX, lambda event: event.x,
           Bin(numY, lowY, highY, lambda event: event.y))

(Python语法;用lambda函数代替Scala等).第一个 Bin 聚合器通过以下方式对数据进行分区event.x并将其传递给第二个,恰好是另一个Bin而不是默认的

(Python syntax; substitute lambda-functions for Scala, etc.). The first Bin aggregator partitions data by event.x and passes it on to the second, which happens to be another Bin instead of the default Count.

但是您说模拟SQL的groupBy/聚合".可以对整数bin进行分组以将SQL查询用作直方图,在这种情况下,直方图示例只是一种更简单的方法.但是,当人们在SQL中使用GROUP BY时,他们通常会按一些分类数据(例如字符串)进行分组.

But you say "simulating groupBy/aggregation of SQL." It's possible to GROUP BY an integer bin number to use an SQL query as a histogram, in which case the Histogrammar example is just a much easier way to do it. However, when people GROUP BY in SQL, they are usually grouping by some categorical data, such as a string.

在直方图中,应该是

groupedHists = Categorize(lambda event: event.category,
                 Bin(num, low, high, lambda event: event.numerical))

在这里 Categorize 代替Bin为每个唯一字符串创建一个新的子聚合器.

Here, Categorize takes the place of Bin to make a new sub-aggregator for each unique string.

最后,如果您使用的类别数量过多,则可能要使用基础系统的(例如Spark的)map-reduce功能来按键进行聚合.如果Histogrammar做到了,Spark会随机将数据发送给N个工作人员,每个工作人员都收集所有类别的数据,然后费力地进行合并.如果使用Spark,Spark会将给定类别的所有数据发送到同一工作人员,从而减少整体内存使用,并使合并更加容易.

Finally, if you're working with an exceedingly large number of categories, you probably want to use the underlying system's (e.g. Spark's) map-reduce functionality to do the aggregation-by-key. If Histogrammar does it, Spark would randomly send data to N workers, each collecting data for all categories, which are then laboriously merged. If Spark does it, Spark will send all data for a given category to the same worker, using less memory overall and making the merging easier.

这是Spark(Scala)中groupedHists的有效版本:

Here's an efficient version of groupedHists in Spark (Scala):

val groupedHists =
    rdd.map(event => (event.category, event))
       .aggregateByKey(Bin(num, low, high, {event: Event => event.numerical}))
         (new Increment, new Combine)
       .collect

这将为您提供(String, Histogram)对,而不是如上所述的组合分类直方图,但这是相同的信息.

This will give you (String, Histogram) pairs, rather than a combined Categorical-Binned histogram as above, but it's the same information.

这篇关于直方图中的二维聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆