Apache Spark按范围划分SQL组数据 [英] Apache spark SQL group data by range
本文介绍了Apache Spark按范围划分SQL组数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
- 我有一个表,其中包含一列年龄".我想将人们按照年龄分组,例如[0,5),[5,10),[10,15),....
- 然后,我将对每个组进行相同的计算并比较结果.
- 目标是查看年龄是否与其他变量相关.
- 请帮助.
推荐答案
演示:
样本DF:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.ml.feature.Bucketizer
scala> val df = spark.range(20).withColumn("age", round(rand()*90).cast(IntegerType))
df: org.apache.spark.sql.DataFrame = [id: bigint, age: int]
scala> df.show
+---+---+
| id|age|
+---+---+
| 0| 58|
| 1| 57|
| 2| 43|
| 3| 62|
| 4| 18|
| 5| 70|
| 6| 26|
| 7| 54|
| 8| 70|
| 9| 42|
| 10| 38|
| 11| 79|
| 12| 77|
| 13| 14|
| 14| 87|
| 15| 28|
| 16| 15|
| 17| 59|
| 18| 81|
| 19| 25|
+---+---+
解决方案:
scala> :paste
// Entering paste mode (ctrl-D to finish)
val splits = Range.Double(0,120,5).toArray
val bucketizer = new Bucketizer()
.setInputCol("age")
.setOutputCol("age_range_id")
.setSplits(splits)
val df2 = bucketizer.transform(df)
// Exiting paste mode, now interpreting.
splits: Array[Double] = Array(0.0, 5.0, 10.0, 15.0, 20.0, 25.0, 30.0, 35.0, 40.0, 45.0, 50.0, 55.0, 60.0, 65.0, 70.0, 75.0, 80.0, 85.0, 90.0, 95.0, 100.0, 105.0, 110.0, 115.0)
bucketizer: org.apache.spark.ml.feature.Bucketizer = bucketizer_3c2040bf50c7
df2: org.apache.spark.sql.DataFrame = [id: bigint, age: int ... 1 more field]
scala> df2.groupBy("age_range_id").count().show
+------------+-----+
|age_range_id|count|
+------------+-----+
| 8.0| 2|
| 7.0| 1|
| 11.0| 3|
| 14.0| 2|
| 3.0| 2|
| 2.0| 1|
| 17.0| 1|
| 10.0| 1|
| 5.0| 3|
| 15.0| 2|
| 16.0| 1|
| 12.0| 1|
+------------+-----+
或者您可以使用Spark SQL API:
Alternatively you can use Spark SQL API:
df.createOrReplaceTempView("tab")
val query = """
with t as (select int(age/5) as age_id from tab)
select age_id, count(*) as count
from t
group by age_id
"""
spark.sql(query).show
结果:
scala> spark.sql(query).show
+------+-----+
|age_id|count|
+------+-----+
| 12| 1|
| 16| 1|
| 3| 2|
| 5| 3|
| 15| 2|
| 17| 1|
| 8| 2|
| 7| 1|
| 10| 1|
| 11| 3|
| 14| 2|
| 2| 1|
+------+-----+
这篇关于Apache Spark按范围划分SQL组数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文