如何将数据聚合到范围内(bucketize)? [英] How to aggregate data into ranges (bucketize)?
问题描述
我有一张像
+---------------+------+
|id | value|
+---------------+------+
| 1|118.0|
| 2|109.0|
| 3|113.0|
| 4| 82.0|
| 5| 60.0|
| 6|111.0|
| 7|107.0|
| 8| 84.0|
| 9| 91.0|
| 10|118.0|
+---------------+------+
ans 想要将值聚合或合并到一个范围 0,10,20,30,40,...80,90,100,110,120
我如何在 SQL 或更具体的 spark sql 中执行此操作?
ans would like aggregate or bin the values to a range 0,10,20,30,40,...80,90,100,110,120
how can I perform this in SQL or more specific spark sql?
目前我有一个与范围连接的横向视图,但这似乎相当笨拙/效率低下.
Currently I have a lateral view join with the range but this seems rather clumsy / inefficient.
离散化的分位数并不是我真正想要的,而是具有此范围的 CUT
.
The quantile discretized is not really what I want, rather a CUT
with this range.
https://github.com/collectivemedia/spark-ext/blob/master/sparkext-mllib/src/main/scala/org/apache/spark/ml/feature/Binning.scala 会执行动态垃圾箱,但我宁愿需要这个指定的范围.
https://github.com/collectivemedia/spark-ext/blob/master/sparkext-mllib/src/main/scala/org/apache/spark/ml/feature/Binning.scala would perform dynamic bins, but I would rather need this specified range.
推荐答案
用这个试试GROUP BY"
Try "GROUP BY" with this
SELECT id, (value DIV 10)*10 FROM table_name ;
以下将使用 Scala 的数据集 API:
The following would be using the Dataset API for Scala:
df.select(('value divide 10).cast("int")*10)
这篇关于如何将数据聚合到范围内(bucketize)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!