load()在spark中有什么作用? [英] what does load() do in spark?

查看:110
本文介绍了load()在spark中有什么作用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

火花是懒惰的吧?那么 load()的作用是什么?

spark is lazy right? so what does load() do?

start = timeit.default_timer()

 df = sqlContext.read.option(
     "es.resource", indexes
 ).format("org.elasticsearch.spark.sql")
 end = timeit.default_timer()

 print('without load: ', end - start) # almost instant
 start = timeit.default_timer()

 df = df.load()
 end = timeit.default_timer()
 print('load: ', end - start) # takes 1sec

 start = timeit.default_timer()

 df.show()
 end = timeit.default_timer()
 print('show: ', end - start) # takes 4 sec

如果 show()是唯一的操作,我想 load 不会花费1秒的时间.所以我得出结论 load()是一个动作(与spark中的转换相反)

If show() is the only action, I would guess load won't take much time as 1sec. So I'm concluding load() is an action (as opposed to transformation in spark)

加载实际上会将整个数据加载到内存中吗?我不这么认为,但是那是什么呢?

Does load actually load whole data into memory? I don't think so, but then what does it do?

我已经搜索并查看了文档

I've searched and looked at the doc https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html but it doesn't help..

推荐答案

tl; dr load()是DataFrameReader api( org.apache.spark.sql.DataFrameReader#load ),如下面的代码所示,它返回一个 DataFrame ,在其顶部可以应用Spark转换.

tl;dr load() is a DataFrameReader api(org.apache.spark.sql.DataFrameReader#load) as seen from the below code, that returns a DataFrame, on top which Spark transformations can be applied.

/**
   * Loads input in as a `DataFrame`, for data sources that support multiple paths.
   * Only works if the source is a HadoopFsRelationProvider.
   *
   * @since 1.6.0
   */
  @scala.annotation.varargs
  def load(paths: String*): DataFrame

需要创建一个DataFrame来执行转换.
要从路径(HDFS,S3等)创建数据帧,用户可以使用 spark.read.format(< format>").load().(也有特定于数据源的API会自动加载文件,例如 spark.read.parquet(< path>))

One needs to create a DataFrame to perform a transformation.
To create a dataframe from a path(HDFS, S3 etc), users can use spark.read.format("<format>").load().(There are datasource specific API as well that loads the files automatically like spark.read.parquet(<path>))

在基于文件的来源中,这次可以归因于文件列表.在HDFS中,这些清单并不昂贵,就象S3这样的云存储而言,该清单非常昂贵,并且花费时间与文件数成正比.
在您的情况下,使用的数据源是 elastic-search .时间可以归因于连接建立,收集元数据以执行分布式扫描等,具体取决于Elastic serach连接器的实现.我们可以启用调试日志并检查更多信息.如果弹性搜索可以记录收到的请求,我们可以检查弹性搜索日志中是否触发了在 load()时间后发出的请求.

In file based sources, this time can be attributed to listing of files. In HDFS these listing is not expensive, where as in case of cloud storage like S3, this listing is very expensive and takes time propotionate to number of files.
In your case the datasource used is elastic-search, The time can be attributed to connection establishment, collecting metadata to perform a distributed scan etc which depends on Elastic serach connector impl. We can enable the debug logs and check for more information. If elastic search has way to log the request it received, we could check the elastic search logs for the requests that were made after the time load() was fired.

这篇关于load()在spark中有什么作用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆