Spark 数据集过滤器性能 [英] Spark DataSet filter performance

查看：38 发布时间：2021/11/14 21:53:57 apache-spark apache-spark-sql spark-dataframe apache-spark-dataset

本文介绍了Spark 数据集过滤器性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直在尝试不同的方法来过滤类型化的数据集.事实证明，性能可能完全不同.

I have been experimenting different ways to filter a typed data set. It turns out the performance can be quite different.

该数据集是基于 1.6 GB 数据行创建的，其中包含 33 列和 4226047 行.DataSet 是通过加载 csv 数据创建的，并映射到一个案例类.

The data set was created based on a 1.6 GB rows of data with 33 columns and 4226047 rows. DataSet is created by loading csv data and mapped to a case class.

val df = spark.read.csv(csvFile).as[FireIncident]

UnitId = 'B02' 上的过滤器应返回 47980 行.我测试了以下三种方法:1) 使用类型列(本地主机上约 500 毫秒)

A filter on UnitId = 'B02' should return 47980 rows. I tested three ways as below: 1) Use typed column (~ 500 ms on local host)

df.where($"UnitID" === "B02").count()

2) 使用临时表和 sql 查询(~与选项 1 相同)

2) Use temp table and sql query (~ same as option 1)

df.createOrReplaceTempView("FireIncidentsSF")
spark.sql("SELECT * FROM FireIncidentsSF WHERE UnitID='B02'").count()

3) 使用强类型类字段(14,987 毫秒，即慢 30 倍)

3) Use strong typed class field (14,987ms, i.e. 30 times as slow)

df.filter(_.UnitID.orNull == "B02").count()

我用python API再次测试，对于相同的数据集，时间为17,046 ms，与scala API选项3的性能相当.

I tested it again with the python API, for the same data set, the timing is 17,046 ms, comparable to the performance of the scala API option 3.

df.filter(df['UnitID'] == 'B02').count()

有人可以解释一下 3) 和 python API 的执行方式与前两个选项不同吗?

Could someone shed some light on how 3) and the python API are executed differently from the first two options?

Spark 数据集过滤器性能 [英] Spark DataSet filter performance

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark 数据集过滤器性能 [英] Spark DataSet filter performance

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭