直方图中的二维聚合 [英] Two-dimensional aggregation in Histogrammar
问题描述
在我发现的示例中,装仓仅对一维数据数组执行.我想对2D数据进行装箱,以模拟SQL的分组/聚合.使用直方图可以做到吗?
In the examples I found, binning is only performed on a 1D array of data. I would like to bin 2D data in order to simulating the groupby/aggregation of SQL. Is that possible using histogrammar?
(问题从Michel Page转贴了.)
(Question reposted from Michel Page.)
推荐答案
是的,可以通过嵌套1D聚合器来聚合2D数据.一个简单的例子是2D直方图:
Yes, it is possible to aggregate 2D data by nesting 1D aggregators. A simple example is a 2D histogram:
hist2d = Bin(numX, lowX, highX, lambda event: event.x,
Bin(numY, lowY, highY, lambda event: event.y))
(Python语法;用lambda函数代替Scala等).第一个 Bin
聚合器通过以下方式对数据进行分区event.x
并将其传递给第二个,恰好是另一个Bin
而不是默认的
(Python syntax; substitute lambda-functions for Scala, etc.). The first Bin
aggregator partitions data by event.x
and passes it on to the second, which happens to be another Bin
instead of the default Count
.
但是您说模拟SQL的groupBy/聚合".可以对整数bin进行分组以将SQL查询用作直方图,在这种情况下,直方图示例只是一种更简单的方法.但是,当人们在SQL中使用GROUP BY时,他们通常会按一些分类数据(例如字符串)进行分组.
But you say "simulating groupBy/aggregation of SQL." It's possible to GROUP BY an integer bin number to use an SQL query as a histogram, in which case the Histogrammar example is just a much easier way to do it. However, when people GROUP BY in SQL, they are usually grouping by some categorical data, such as a string.
在直方图中,应该是
groupedHists = Categorize(lambda event: event.category,
Bin(num, low, high, lambda event: event.numerical))
在这里 Categorize
代替Bin
为每个唯一字符串创建一个新的子聚合器.
Here, Categorize
takes the place of Bin
to make a new sub-aggregator for each unique string.
最后,如果您使用的类别数量过多,则可能要使用基础系统的(例如Spark的)map-reduce功能来按键进行聚合.如果Histogrammar做到了,Spark会随机将数据发送给N个工作人员,每个工作人员都收集所有类别的数据,然后费力地进行合并.如果使用Spark,Spark会将给定类别的所有数据发送到同一工作人员,从而减少整体内存使用,并使合并更加容易.
Finally, if you're working with an exceedingly large number of categories, you probably want to use the underlying system's (e.g. Spark's) map-reduce functionality to do the aggregation-by-key. If Histogrammar does it, Spark would randomly send data to N workers, each collecting data for all categories, which are then laboriously merged. If Spark does it, Spark will send all data for a given category to the same worker, using less memory overall and making the merging easier.
这是Spark(Scala)中groupedHists
的有效版本:
Here's an efficient version of groupedHists
in Spark (Scala):
val groupedHists =
rdd.map(event => (event.category, event))
.aggregateByKey(Bin(num, low, high, {event: Event => event.numerical}))
(new Increment, new Combine)
.collect
这将为您提供(String, Histogram)
对,而不是如上所述的组合分类直方图,但这是相同的信息.
This will give you (String, Histogram)
pairs, rather than a combined Categorical-Binned histogram as above, but it's the same information.
这篇关于直方图中的二维聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!