如何从Spark的csv文件中跳过标题? [英] How to skip header from csv files in Spark?

查看:320
本文介绍了如何从Spark的csv文件中跳过标题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我给出三个文件路径,以便spark上下文读取,每个文件在第一行有模式。如何从头中跳过模式行

  val rdd = sc.textFile(file1,file2,file3)

h2_lin>解决方案

如果只有一个标题行,在第一个记录中,则最有效的过滤方式是:

  rdd.mapPartitionsWithIndex {(idx,iter)=> if(idx == 0)iter.drop(1)else iter} 

如果当然有许多文件与许多标题行内。



你也可以只写一个过滤器一个可以是标题的行。这很简单,但效率较低。


Suppose I give three files paths to spark context to read and each file has schema in first row. how can we skip schema lines from headers

val rdd=sc.textFile("file1,file2,file3")

now how can we skip header lines from this rdd?

解决方案

If there were just one header line, in the first record, then the most efficient way to filter it out is:

rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }

This doesn't help if of course there are many files with many header lines inside. You can union 3 RDDs you make this way, indeed.

You could also just write a filter that matches only a line that could be a header. This is quite simple but less efficient.

这篇关于如何从Spark的csv文件中跳过标题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆