Spark DataSet过滤器性能 [英] Spark DataSet filter performance

查看:182
本文介绍了Spark DataSet过滤器性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试不同的方法来过滤类型化的数据集.事实证明,性能可能大不相同.

I have been experimenting different ways to filter a typed data set. It turns out the performance can be quite different.

该数据集是基于1.6 GB的数据行(具有33列和4226047行)创建的.通过加载csv数据创建DataSet并将其映射到案例类.

The data set was created based on a 1.6 GB rows of data with 33 columns and 4226047 rows. DataSet is created by loading csv data and mapped to a case class.

val df = spark.read.csv(csvFile).as[FireIncident]

UnitId ='B02'上的过滤器应返回47980行.我测试了以下三种方法: 1)使用类型列(在本地主机上约为500毫秒)

A filter on UnitId = 'B02' should return 47980 rows. I tested three ways as below: 1) Use typed column (~ 500 ms on local host)

df.where($"UnitID" === "B02").count()

2)使用临时表和sql查询(〜与选项1相同)

2) Use temp table and sql query (~ same as option 1)

df.createOrReplaceTempView("FireIncidentsSF")
spark.sql("SELECT * FROM FireIncidentsSF WHERE UnitID='B02'").count()

3)使用强类型类字段(14,987ms,即慢30倍)

3) Use strong typed class field (14,987ms, i.e. 30 times as slow)

df.filter(_.UnitID.orNull == "B02").count()

我使用python API再次对其进行了测试,对于相同的数据集,计时时间为17,046 ms,与scala API选项3的性能相当.

I tested it again with the python API, for the same data set, the timing is 17,046 ms, comparable to the performance of the scala API option 3.

df.filter(df['UnitID'] == 'B02').count()

有人可以阐明3)和python API与前两个选项的执行方式不同吗?

Could someone shed some light on how 3) and the python API are executed differently from the first two options?

推荐答案

这是由于步骤3 在前两个方法中,spark不需要反序列化整个Java/Scala对象-它仅查看一列并继续进行操作.

In the first two, spark doesn't need to deserialize the whole Java/Scala object - it just looks at the one column and moves on.

在第三种方法中,由于您使用的是lambda函数,因此spark不能告诉您只需要一个字段,因此它将每行的所有33个字段都从内存中拉出,以便您可以检查一个字段

In the third, since you're using a lambda function, spark can't tell that you just want the one field, so it pulls all 33 fields out of memory for each row, so that you can check the one field.

我不确定第四个为什么这么慢.似乎它的工作方式与第一种相同.

I'm not sure why the fourth is so slow. It seems like it would work the same way as the first.

这篇关于Spark DataSet过滤器性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆