Apache Spark Dataframe-从CSV文件的第n行加载数据 [英] Apache Spark Dataframe - Load data from nth line of a CSV file

查看:305
本文介绍了Apache Spark Dataframe-从CSV文件的第n行加载数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想处理一个大订单的CSV文件(5GB),在文件的开头有一些元数据行. 标题列在第4行(以"h,开头")中表示,其后是另一个元数据行,描述了可选性.数据行以"d"开头

I would like to process a huge order CSV file (5GB), with some metadata rows at the start of file. Header columns are represented in row 4 (starting with "h,") followed by another metadata row, describing optionality. Data rows start with "d,"

m,Version,v1.0
m,Type,xx
m,<OtherMetaData>,<...>
h,Col1,Col2,Col3,Col4,Col5,.............,Col100
m,Mandatory,Optional,Optional,...........,Mandatory
d,Val1,Val2,Val3,Val4,Val5,.............,Val100

加载文件时是否可以跳过指定的行数,并为DataSet使用'inferSchema'选项?

Is it possible to skip a specified number of rows when loading the file and use 'inferSchema' option for DataSet?

Dataset<Row> df = spark.read()
            .format("csv")
            .option("header", "true")
            .option("inferSchema", "true")
            .load("\home\user\data\20170326.csv");

还是我需要定义两个不同的数据集并使用"except(Dataset other)"来排除要忽略行的数据集?

Or do I need to define two different Datasets and use "except(Dataset other)" to exclude the dataset with rows to be ignored?

推荐答案

您可以尝试将"comment"选项设置为"m",有效地告诉csv阅读器跳过以"m"字符开头的行.

You can try setting the "comment" option to "m", effectively telling the csv reader to skip lines beginning with the "m" character.

df = spark.read()
          .format("csv")
          .option("header", "true")
          .option("inferSchema", "true")
          .option("comment", "m")
          .load("\home\user\data\20170326.csv")

这篇关于Apache Spark Dataframe-从CSV文件的第n行加载数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆