Spark甚至列出分区数据中的所有叶节点 [英] Spark lists all leaf node even in partitioned data

查看：187 发布时间：2020/8/23 4:15:00 apache-spark amazon-s3 apache-spark-sql partitioning parquet

本文介绍了Spark甚至列出分区数据中的所有叶节点的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有按date& hour，文件夹结构:

I have parquet data partitioned by date & hour, folder structure:

events_v3
  -- event_date=2015-01-01
    -- event_hour=2015-01-1
      -- part10000.parquet.gz
  -- event_date=2015-01-02
    -- event_hour=5
      -- part10000.parquet.gz

我已经通过spark创建了一个表raw_events，但是当我尝试查询时，它会扫描所有目录中的页脚，这会减慢初始查询的速度，即使我只查询一天的数据也是如此.

I have created a table raw_events via spark but when I try to query, it scans all the directories for footer and that slows down the initial query, even if I am querying only one day worth of data.

查询: select * from raw_events where event_date='2016-01-01'

类似问题: http://mail-archives.apache.org/mod_mbox/spark-user/201508.mbox/%3CCAAswR-7Qbd2tdLSsO76zyw9tvs-Njw2YVd36bRfCG3DKZrH0tw@mail.gmail.com%3E

日志:

App > 16/09/15 03:14:03 main INFO HadoopFsRelation: Listing leaf files and directories in parallel under: s3a://bucket/events_v3/

然后它会产生350个任务，因为有350天的数据价值.

and then it spawns 350 tasks since there are 350 days worth of data.

我禁用了schemaMerge，并且还指定了要读取为的架构，因此它可以转到我正在查看的分区，为什么它应该打印所有叶文件? 用2个执行程序列出叶子文件需要10分钟，而查询的实际执行需要20秒

I have disabled schemaMerge, and have also specified the schema to read as, so it can just go to the partition that I am looking at, why should it print all the leaf files ? Listing leaf files with 2 executors take 10 minutes, and the query actual execution takes on 20 seconds

代码示例:

val sparkSession = org.apache.spark.sql.SparkSession.builder.getOrCreate()
val df = sparkSession.read.option("mergeSchema","false").format("parquet").load("s3a://bucket/events_v3")
    df.createOrReplaceTempView("temp_events")
    sparkSession.sql(
      """
        |select verb,count(*) from temp_events where event_date = "2016-01-01" group by verb
      """.stripMargin).show()

推荐答案

一旦给spark提供了一个要从中读取的目录，就会发出对listLeafFiles(org/apache/spark/sql/execution/datasources/fileSourceInterfaces的调用).斯卡拉(Scala).反过来，这会调用fs.listStatus，后者会进行api调用以获取文件和目录的列表.现在，对于每个目录，再次调用此方法.这会递归发生，直到没有目录为止.从设计上讲，这在HDFS系统中效果很好.但是在s3中效果不佳，因为列表文件是RPC调用. S3在其他方面支持通过前缀获取所有文件，这正是我们所需要的.

As soon as spark is given a directory to read from it issues call to listLeafFiles (org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala). This in turn calls fs.listStatus which makes an api call to get list of files and directories. Now for each directory this method is called again. This hapens recursively until no directories are left. This by design works good in a HDFS system. But works bad in s3 since list file is an RPC call. S3 on other had supports get all files by prefix, which is exactly what we need.

因此，例如，如果我们具有上面的目录结构，并且该目录结构具有价值1年的数据，并且每个目录小时和10个子目录，我们将拥有365 * 24 * 10 = 87k个api调用，则可以将其减少为138个api调用只有137000个文件.每个s3 api调用都会返回1000个文件.

So for example if we had above directory structure with 1 year worth of data with each directory for hour and 10 sub directory we would have , 365 * 24 * 10 = 87k api calls, this can be reduced to 138 api calls given that there are only 137000 files. Each s3 api calls return 1000 files.

代码: org/apache/hadoop/fs/s3a/S3AFileSystem.java

public FileStatus[] listStatusRecursively(Path f) throws FileNotFoundException,
            IOException {
        String key = pathToKey(f);
        if (LOG.isDebugEnabled()) {
            LOG.debug("List status for path: " + f);
        }

        final List<FileStatus> result = new ArrayList<FileStatus>();
        final FileStatus fileStatus =  getFileStatus(f);

        if (fileStatus.isDirectory()) {
            if (!key.isEmpty()) {
                key = key + "/";
            }

            ListObjectsRequest request = new ListObjectsRequest();
            request.setBucketName(bucket);
            request.setPrefix(key);
            request.setMaxKeys(maxKeys);

            if (LOG.isDebugEnabled()) {
                LOG.debug("listStatus: doing listObjects for directory " + key);
            }

            ObjectListing objects = s3.listObjects(request);
            statistics.incrementReadOps(1);

            while (true) {
                for (S3ObjectSummary summary : objects.getObjectSummaries()) {
                    Path keyPath = keyToPath(summary.getKey()).makeQualified(uri, workingDir);
                    // Skip over keys that are ourselves and old S3N _$folder$ files
                    if (keyPath.equals(f) || summary.getKey().endsWith(S3N_FOLDER_SUFFIX)) {
                        if (LOG.isDebugEnabled()) {
                            LOG.debug("Ignoring: " + keyPath);
                        }
                        continue;
                    }

                    if (objectRepresentsDirectory(summary.getKey(), summary.getSize())) {
                        result.add(new S3AFileStatus(true, true, keyPath));
                        if (LOG.isDebugEnabled()) {
                            LOG.debug("Adding: fd: " + keyPath);
                        }
                    } else {
                        result.add(new S3AFileStatus(summary.getSize(),
                                dateToLong(summary.getLastModified()), keyPath,
                                getDefaultBlockSize(f.makeQualified(uri, workingDir))));
                        if (LOG.isDebugEnabled()) {
                            LOG.debug("Adding: fi: " + keyPath);
                        }
                    }
                }

                for (String prefix : objects.getCommonPrefixes()) {
                    Path keyPath = keyToPath(prefix).makeQualified(uri, workingDir);
                    if (keyPath.equals(f)) {
                        continue;
                    }
                    result.add(new S3AFileStatus(true, false, keyPath));
                    if (LOG.isDebugEnabled()) {
                        LOG.debug("Adding: rd: " + keyPath);
                    }
                }

                if (objects.isTruncated()) {
                    if (LOG.isDebugEnabled()) {
                        LOG.debug("listStatus: list truncated - getting next batch");
                    }

                    objects = s3.listNextBatchOfObjects(objects);
                    statistics.incrementReadOps(1);
                } else {
                    break;
                }
            }
        } else {
            if (LOG.isDebugEnabled()) {
                LOG.debug("Adding: rd (not a dir): " + f);
            }
            result.add(fileStatus);
        }

        return result.toArray(new FileStatus[result.size()]);
    }

/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala

def listLeafFiles(fs: FileSystem, status: FileStatus, filter: PathFilter): Array[FileStatus] = {
    logTrace(s"Listing ${status.getPath}")
    val name = status.getPath.getName.toLowerCase
    if (shouldFilterOut(name)) {
      Array.empty[FileStatus]
    }
    else {
      val statuses = {
        val stats = if(fs.isInstanceOf[S3AFileSystem]){
          logWarning("Using Monkey patched version of list status")
          println("Using Monkey patched version of list status")
          val a = fs.asInstanceOf[S3AFileSystem].listStatusRecursively(status.getPath)
          a
//          Array.empty[FileStatus]
        }
        else{
          val (dirs, files) = fs.listStatus(status.getPath).partition(_.isDirectory)
          files ++ dirs.flatMap(dir => listLeafFiles(fs, dir, filter))

        }
        if (filter != null) stats.filter(f => filter.accept(f.getPath)) else stats
      }
      // statuses do not have any dirs.
      statuses.filterNot(status => shouldFilterOut(status.getPath.getName)).map {
        case f: LocatedFileStatus => f

        // NOTE:
        //
        // - Although S3/S3A/S3N file system can be quite slow for remote file metadata
        //   operations, calling `getFileBlockLocations` does no harm here since these file system
        //   implementations don't actually issue RPC for this method.
        //
        // - Here we are calling `getFileBlockLocations` in a sequential manner, but it should not
        //   be a big deal since we always use to `listLeafFilesInParallel` when the number of
        //   paths exceeds threshold.
        case f => createLocatedFileStatus(f, fs.getFileBlockLocations(f, 0, f.getLen))
      }
    }
  }

这篇关于Spark甚至列出分区数据中的所有叶节点的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark甚至列出分区数据中的所有叶节点 [英] Spark lists all leaf node even in partitioned data

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark甚至列出分区数据中的所有叶节点 [英] Spark lists all leaf node even in partitioned data

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭