Spark 数据集过滤器性能 [英] Spark DataSet filter performance

查看:38
本文介绍了Spark 数据集过滤器性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试不同的方法来过滤类型化的数据集.事实证明,性能可能完全不同.

I have been experimenting different ways to filter a typed data set. It turns out the performance can be quite different.

该数据集是基于 1.6 GB 数据行创建的,其中包含 33 列和 4226047 行.DataSet 是通过加载 csv 数据创建的,并映射到一个案例类.

The data set was created based on a 1.6 GB rows of data with 33 columns and 4226047 rows. DataSet is created by loading csv data and mapped to a case class.

val df = spark.read.csv(csvFile).as[FireIncident]

UnitId = 'B02' 上的过滤器应返回 47980 行.我测试了以下三种方法:1) 使用类型列(本地主机上约 500 毫秒)

A filter on UnitId = 'B02' should return 47980 rows. I tested three ways as below: 1) Use typed column (~ 500 ms on local host)

df.where($"UnitID" === "B02").count()

2) 使用临时表和 sql 查询(~与选项 1 相同)

2) Use temp table and sql query (~ same as option 1)

df.createOrReplaceTempView("FireIncidentsSF")
spark.sql("SELECT * FROM FireIncidentsSF WHERE UnitID='B02'").count()

3) 使用强类型类字段(14,987 毫秒,即慢 30 倍)

3) Use strong typed class field (14,987ms, i.e. 30 times as slow)

df.filter(_.UnitID.orNull == "B02").count()

我用python API再次测试,对于相同的数据集,时间为17,046 ms,与scala API选项3的性能相当.

I tested it again with the python API, for the same data set, the timing is 17,046 ms, comparable to the performance of the scala API option 3.

df.filter(df['UnitID'] == 'B02').count()

有人可以解释一下 3) 和 python API 的执行方式与前两个选项不同吗?

Could someone shed some light on how 3) and the python API are executed differently from the first two options?

推荐答案

这是因为第 3 步 此处.

It's because of step 3 here.

在前两个中,spark 不需要反序列化整个 Java/Scala 对象 - 它只是查看一列并继续前进.

In the first two, spark doesn't need to deserialize the whole Java/Scala object - it just looks at the one column and moves on.

在第三个中,由于您使用的是 lambda 函数,因此 spark 无法判断您只想要一个字段,因此它会将每一行的所有 33 个字段从内存中提取出来,以便您可以检查一个字段.

In the third, since you're using a lambda function, spark can't tell that you just want the one field, so it pulls all 33 fields out of memory for each row, so that you can check the one field.

我不知道为什么第四个这么慢.看起来它的工作方式与第一个相同.

I'm not sure why the fourth is so slow. It seems like it would work the same way as the first.

这篇关于Spark 数据集过滤器性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆