在星火每个文件的旁路第一行（斯卡拉） [英] Bypass first line of each file in Spark (Scala)

查看：183 发布时间：2015/12/1 10:20:11 scala amazon-s3 apache-spark

本文介绍了在星火每个文件的旁路第一行（斯卡拉）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我处理包含在星火csv.gz文件的S3文件夹。每个csv.gz文件都有一个包含列名的标题。

I am processing an S3 folder containing csv.gz files in Spark. Each csv.gz file has a header that contains column names.

我加载包含的数据，以星火的方法是参考的路径/文件夹，如下所示：

The way I load the contained data to Spark is to reference the path / folder, like this:

val rdd = sc.textFile("s3://.../my-s3-path")

我

如何跳过每个文件头，这样我就可以只处理值？

How can I skip the header in each file, so that I can process the values only?

感谢

推荐答案

您可以这样做：

val rdd = sc.textFile("s3://.../my-s3-path").mapPartitions(_.drop(1))

由于每个输入文件gzip压缩，这将是一个单独的分区下的加载。如果我们映射在所有分区和下降的第一行中，我们将因此从每个文件移除第一线

Because each input file is gzipped, it will be loaded under a separate partition. If we map across all partitions and drop the first line, we will consequently be removing the first line from each file.

这篇关于在星火每个文件的旁路第一行（斯卡拉）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在星火每个文件的旁路第一行（斯卡拉） [英] Bypass first line of each file in Spark (Scala)

问题描述

推荐答案

相关文章

云存储最新文章

热门教程

热门工具

登录关闭

在星火每个文件的旁路第一行（斯卡拉） [英] Bypass first line of each file in Spark (Scala)

问题描述

推荐答案

相关文章

云存储最新文章

热门教程

热门工具

登录 关闭

登录关闭