如何将数据聚合到范围内(存储桶)? [英] How to aggregate data into ranges (bucketize)?
问题描述
我有一张桌子
+---------------+------+
|id | value|
+---------------+------+
| 1|118.0|
| 2|109.0|
| 3|113.0|
| 4| 82.0|
| 5| 60.0|
| 6|111.0|
| 7|107.0|
| 8| 84.0|
| 9| 91.0|
| 10|118.0|
+---------------+------+
人们想将值聚合或合并到一个范围内0,10,20,30,40,...80,90,100,110,120
我该如何在SQL或更具体的spark sql中执行此操作?
ans would like aggregate or bin the values to a range 0,10,20,30,40,...80,90,100,110,120
how can I perform this in SQL or more specific spark sql?
目前,我对该范围有一个侧面视图,但这似乎很笨拙/效率低下.
Currently I have a lateral view join with the range but this seems rather clumsy / inefficient.
离散化的分位数并不是我真正想要的,而是具有此范围的CUT
.
The quantile discretized is not really what I want, rather a CUT
with this range.
https://github.com/collectivemedia/spark-ext/blob/master/sparkext-mllib/src/main/scala/org/apache/spark/ml/feature/Binning.scala 将执行动态垃圾箱,但我宁愿需要此指定范围.
https://github.com/collectivemedia/spark-ext/blob/master/sparkext-mllib/src/main/scala/org/apache/spark/ml/feature/Binning.scala would perform dynamic bins, but I would rather need this specified range.
推荐答案
尝试使用"GROUP BY"
Try "GROUP BY" with this
SELECT id, (value DIV 10)*10 FROM table_name ;
以下将使用Scala的Dataset API:
The following would be using the Dataset API for Scala:
df.select(('value divide 10).cast("int")*10)
这篇关于如何将数据聚合到范围内(存储桶)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!