如何跳过 Spark 中 CSV 文件的标题? [英] How do I skip a header from CSV files in Spark?

查看:45
本文介绍了如何跳过 Spark 中 CSV 文件的标题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我为要读取的 Spark 上下文提供了三个文件路径,并且每个文件在第一行都有一个架构.我们如何从标题中跳过模式行?

Suppose I give three files paths to a Spark context to read and each file has a schema in the first row. How can we skip schema lines from headers?

val rdd=sc.textFile("file1,file2,file3")

现在,我们如何从这个 rdd 中跳过标题行?

Now, how can we skip header lines from this rdd?

推荐答案

如果第一条记录中只有一个标题行,那么最有效的过滤方法是:

If there were just one header line in the first record, then the most efficient way to filter it out would be:

rdd.mapPartitionsWithIndex {
  (idx, iter) => if (idx == 0) iter.drop(1) else iter 
}

当然,如果很多文件里面有很多标题行,这也无济于事.你确实可以用这种方式联合三个 RDD.

This doesn't help if of course there are many files with many header lines inside. You can union three RDDs you make this way, indeed.

您也可以只编写一个 filter 只匹配可能是标题的行.这很简单,但效率较低.

You could also just write a filter that matches only a line that could be a header. This is quite simple, but less efficient.

Python 等效项:

Python equivalent:

from itertools import islice

rdd.mapPartitionsWithIndex(
    lambda idx, it: islice(it, 1, None) if idx == 0 else it 
)

这篇关于如何跳过 Spark 中 CSV 文件的标题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆