如何将数据聚合到范围内(bucketize)? [英] How to aggregate data into ranges (bucketize)?

查看:39
本文介绍了如何将数据聚合到范围内(bucketize)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一张像

+---------------+------+
|id             | value|
+---------------+------+
|               1|118.0|
|               2|109.0|
|               3|113.0|
|               4| 82.0|
|               5| 60.0|
|               6|111.0|
|               7|107.0|
|               8| 84.0|
|               9| 91.0|
|              10|118.0|
+---------------+------+

ans 想要将值聚合或合并到一个范围 0,10,20,30,40,...80,90,100,110,120我如何在 SQL 或更具体的 spark sql 中执行此操作?

ans would like aggregate or bin the values to a range 0,10,20,30,40,...80,90,100,110,120how can I perform this in SQL or more specific spark sql?

目前我有一个与范围连接的横向视图,但这似乎相当笨拙/效率低下.

Currently I have a lateral view join with the range but this seems rather clumsy / inefficient.

离散化的分位数并不是我真正想要的,而是具有此范围的 CUT.

The quantile discretized is not really what I want, rather a CUT with this range.

https://github.com/collectivemedia/spark-ext/blob/master/sparkext-mllib/src/main/scala/org/apache/spark/ml/feature/Binning.scala 会执行动态垃圾箱,但我宁愿需要这个指定的范围.

https://github.com/collectivemedia/spark-ext/blob/master/sparkext-mllib/src/main/scala/org/apache/spark/ml/feature/Binning.scala would perform dynamic bins, but I would rather need this specified range.

推荐答案

用这个试试GROUP BY"

Try "GROUP BY" with this

SELECT id, (value DIV 10)*10 FROM table_name ;

以下将使用 Scala 的数据集 API:

The following would be using the Dataset API for Scala:

df.select(('value divide 10).cast("int")*10)

这篇关于如何将数据聚合到范围内(bucketize)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆