使用大量输入文件来提高InMemoryFileIndex的Spark SQL作业速度 [英] Speed up InMemoryFileIndex for Spark SQL job with large number of input files

查看:612
本文介绍了使用大量输入文件来提高InMemoryFileIndex的Spark SQL作业速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用Java编码的apache spark sql作业(使用数据集),得到的输入是70,000到150,000个文件.

I have an apache spark sql job (using Datasets), coded in Java, that get's it's input from between 70,000 to 150,000 files.

构建InMemoryFileIndex似乎需要45分钟到1.5个小时.

It appears to take anywhere from 45 minutes to 1.5 hours to build the InMemoryFileIndex.

在这段时间内没有日志,网络使用率非常低,并且几乎没有CPU使用率.

There are no logs, very low network usage, and almost no CPU usage during this time.

以下是我在标准输出中看到的示例:

Here's a sample of what I see in the std output:

24698 [main] INFO org.spark_project.jetty.server.handler.ContextHandler  - Started o.s.j.s.ServletContextHandler@32ec9c90{/static/sql,null,AVAILABLE,@Spark}
25467 [main] INFO org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef  - Registered StateStoreCoordinator endpoint
2922000 [main] INFO org.apache.spark.sql.execution.datasources.InMemoryFileIndex  - Listing leaf files and directories in parallel under: <a LOT of file url's...>
2922435 [main] INFO org.apache.spark.SparkContext  - Starting job: textFile at SomeClass.java:103

在这种情况下,有45分钟的时间基本上没有任何反应(据我所知).

In this case, there was 45 minutes of essentially nothing happening (as far as I could tell).

我使用以下文件加载文件:

I load the files using:

sparkSession.read().textFile(pathsArray)

有人可以解释InMemoryFileIndex中发生了什么,如何使这一步更快?

Can someone explain what is going on in InMemoryFileIndex, and how can I make this step faster?

推荐答案

InMemoryFileIndex负责分区发现(并因此对分区进行修剪),它正在执行文件列表,并且可能会运行并行作业,如果您这样做会花费一些时间.有很多文件,因为它必须索引每个文件.这样做时,Spark会收集有关文件的一些基本信息(例如,文件的大小),以计算一些基本统计信息,这些统计信息将在查询计划期间使用.如果您希望每次读入数据时都避免这种情况,则可以使用metastore和saveAsTable()命令将数据另存为数据源表(Spark 2.1支持),并且该分区发现将仅执行一次,然后信息将保存在metastore中.然后您可以使用metastore读取数据

The InMemoryFileIndex is responsible for partition discovery (and consequently partition pruning), it is doing file listing and it may run a parallel job which can take some time if you have a lot of files, since it has to index each file. When doing this, Spark collects some basic information about the files (their size for instance) to compute some basic statistics that are than used during query planning. If you want to avoid this each time you read the data in, you can save the data as a datasource table (it is supported from Spark 2.1) using the metastore and the saveAsTable() command and this partition discovery will be performed only once and the information will be kept in the metastore. Then you can read the data using the metastore

sparkSession.read.table(table_name)

,它应该很快,因为将跳过此分区发现阶段.我建议看 Spark Summit演讲,其中讨论了此问题.

and it should be fast since this partition discovery phase will be skipped. I recommend to see this Spark Summit talk in which this problem is discussed.

这篇关于使用大量输入文件来提高InMemoryFileIndex的Spark SQL作业速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆