为具有大量输入文件的 Spark SQL 作业加速 InMemoryFileIndex [英] Speed up InMemoryFileIndex for Spark SQL job with large number of input files

查看:46
本文介绍了为具有大量输入文件的 Spark SQL 作业加速 InMemoryFileIndex的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用 Java 编码的 apache spark sql 作业(使用数据集),它可以输入 70,000 到 150,000 个文件.

I have an apache spark sql job (using Datasets), coded in Java, that get's it's input from between 70,000 to 150,000 files.

构建 InMemoryFileIndex 似乎需要 45 分钟到 1.5 小时不等.

It appears to take anywhere from 45 minutes to 1.5 hours to build the InMemoryFileIndex.

在此期间没有日志,网络使用率非常低,并且几乎没有 CPU 使用率.

There are no logs, very low network usage, and almost no CPU usage during this time.

这是我在 std 输出中看到的示例:

Here's a sample of what I see in the std output:

24698 [main] INFO org.spark_project.jetty.server.handler.ContextHandler  - Started o.s.j.s.ServletContextHandler@32ec9c90{/static/sql,null,AVAILABLE,@Spark}
25467 [main] INFO org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef  - Registered StateStoreCoordinator endpoint
2922000 [main] INFO org.apache.spark.sql.execution.datasources.InMemoryFileIndex  - Listing leaf files and directories in parallel under: <a LOT of file url's...>
2922435 [main] INFO org.apache.spark.SparkContext  - Starting job: textFile at SomeClass.java:103

在这种情况下,有 45 分钟基本上没有发生任何事情(据我所知).

In this case, there was 45 minutes of essentially nothing happening (as far as I could tell).

我使用以下方法加载文件:

I load the files using:

sparkSession.read().textFile(pathsArray)

谁能解释一下 InMemoryFileIndex 中发生的事情,以及如何加快这一步?

Can someone explain what is going on in InMemoryFileIndex, and how can I make this step faster?

推荐答案

InMemoryFileIndex 负责分区发现(以及随后的分区修剪),它正在执行文件列表,它可能会运行一个并行作业,这可能需要一些时间,如果您有很多文件,因为它必须索引每个文件.执行此操作时,Spark 会收集有关文件的一些基本信息(例如它们的大小),以计算一些在查询计划期间使用的基本统计信息.如果您想避免每次读入数据时出现这种情况,您可以使用 Metastore 和 saveAsTable() 命令将数据保存为数据源表(Spark 2.1 支持),并且此分区发现将仅执行一次并且信息将保存在 Metastore 中.然后就可以使用metastore读取数据了

The InMemoryFileIndex is responsible for partition discovery (and consequently partition pruning), it is doing file listing and it may run a parallel job which can take some time if you have a lot of files, since it has to index each file. When doing this, Spark collects some basic information about the files (their size for instance) to compute some basic statistics that are than used during query planning. If you want to avoid this each time you read the data in, you can save the data as a datasource table (it is supported from Spark 2.1) using the metastore and the saveAsTable() command and this partition discovery will be performed only once and the information will be kept in the metastore. Then you can read the data using the metastore

sparkSession.read.table(table_name)

它应该很快,因为这个分区发现阶段将被跳过.我建议您查看 this 讨论此问题的 Spark 峰会演讲.

and it should be fast since this partition discovery phase will be skipped. I recommend to see this Spark Summit talk in which this problem is discussed.

这篇关于为具有大量输入文件的 Spark SQL 作业加速 InMemoryFileIndex的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆