Spark CSV到DataFrame跳过第一行 [英] Spark csv to dataframe skip first row

查看:768
本文介绍了Spark CSV到DataFrame跳过第一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用--p将csv加载到数据帧

I am loading csv to dataframe using -

sqlContext.read.format("com.databricks.spark.csv").option("header", "true").
                option("delimiter", ",").load("file.csv")

但是我的输入文件的第一行包含日期,第二行包含标题. 例子

but my input file contains date in the first row and header from second row. example

20160612
id,name,age
1,abc,12
2,bcd,33

在将csv转换为数据帧时,如何跳过第一行?

How can i skip this first row while converting csv to dataframe?

推荐答案

由于数据块模块似乎未提供跳过行选项,因此我想到了几个选项:

Here are several options that I can think of since the data bricks module doesn't seem to provide a skip line option:

选项一:在第一行前面添加#"字符,然后该行将被自动视为注释,并由data.bricks csv模块忽略;

Option one: Add a "#" character in front of the first line, and the line will be automatically considered as comment and ignored by the data.bricks csv module;

选项二:创建自定义架构,并将mode选项指定为DROPMALFORMED,这将删除第一行,因为它包含的令牌数量少于customSchema中的预期值:

Option two: Create your customized schema and specify the mode option as DROPMALFORMED which will drop the first line since it contains less token than expected in the customSchema:

import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};

val customSchema = StructType(Array(StructField("id", IntegerType, true), 
                                    StructField("name", StringType, true),
                                    StructField("age", IntegerType, true)))

val df = sqlContext.read.format("com.databricks.spark.csv").
                         option("header", "true").
                         option("mode", "DROPMALFORMED").
                         schema(customSchema).load("test.txt")

df.show

16/06/12 21:24:05 WARN CsvRelation $:数字格式异常.掉落 格式错误的行:id,名称,年龄

16/06/12 21:24:05 WARN CsvRelation$: Number format exception. Dropping malformed line: id,name,age

+---+----+---+
| id|name|age|
+---+----+---+
|  1| abc| 12|
|  2| bcd| 33|
+---+----+---+

请注意此处的警告消息,指出错误的行已掉线:

Note the warning message here which says dropped malformed line:

选项三:编写自己的解析器以删除长度不为三的行:

Option three: Write your own parser to drop the line that doesn't have length of three:

val file = sc.textFile("pathToYourCsvFile")

val df = file.map(line => line.split(",")).
              filter(lines => lines.length == 3 && lines(0)!= "id").
              map(row => (row(0), row(1), row(2))).
              toDF("id", "name", "age")

df.show
+---+----+---+
| id|name|age|
+---+----+---+
|  1| abc| 12|
|  2| bcd| 33|
+---+----+---+

这篇关于Spark CSV到DataFrame跳过第一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆