Spark DataSet过滤器性能 [英] Spark DataSet filter performance

查看：182 发布时间：2020/9/4 5:52:23 apache-spark apache-spark-sql spark-dataframe apache-spark-dataset

本文介绍了Spark DataSet过滤器性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在尝试不同的方法来过滤类型化的数据集.事实证明，性能可能大不相同.

I have been experimenting different ways to filter a typed data set. It turns out the performance can be quite different.

该数据集是基于1.6 GB的数据行(具有33列和4226047行)创建的.通过加载csv数据创建DataSet并将其映射到案例类.

The data set was created based on a 1.6 GB rows of data with 33 columns and 4226047 rows. DataSet is created by loading csv data and mapped to a case class.

val df = spark.read.csv(csvFile).as[FireIncident]

UnitId ='B02'上的过滤器应返回47980行.我测试了以下三种方法: 1)使用类型列(在本地主机上约为500毫秒)

A filter on UnitId = 'B02' should return 47980 rows. I tested three ways as below: 1) Use typed column (~ 500 ms on local host)

df.where($"UnitID" === "B02").count()

2)使用临时表和sql查询(〜与选项1相同)

2) Use temp table and sql query (~ same as option 1)

df.createOrReplaceTempView("FireIncidentsSF")
spark.sql("SELECT * FROM FireIncidentsSF WHERE UnitID='B02'").count()

3)使用强类型类字段(14,987ms，即慢30倍)

3) Use strong typed class field (14,987ms, i.e. 30 times as slow)

df.filter(_.UnitID.orNull == "B02").count()

我使用python API再次对其进行了测试，对于相同的数据集，计时时间为17,046 ms，与scala API选项3的性能相当.

I tested it again with the python API, for the same data set, the timing is 17,046 ms, comparable to the performance of the scala API option 3.

df.filter(df['UnitID'] == 'B02').count()

有人可以阐明3)和python API与前两个选项的执行方式不同吗?

Could someone shed some light on how 3) and the python API are executed differently from the first two options?

Spark DataSet过滤器性能 [英] Spark DataSet filter performance

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark DataSet过滤器性能 [英] Spark DataSet filter performance

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭