Apache Spark Dataframe - 从 CSV 文件的第 n 行加载数据 [英] Apache Spark Dataframe - Load data from nth line of a CSV file

查看:25
本文介绍了Apache Spark Dataframe - 从 CSV 文件的第 n 行加载数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想处理一个巨大的订单 CSV 文件 (5GB),文件开头有一些元数据行.标题列在第 4 行(以h,"开头)表示,后跟另一个元数据行,描述可选性.数据行以d"开头,

I would like to process a huge order CSV file (5GB), with some metadata rows at the start of file. Header columns are represented in row 4 (starting with "h,") followed by another metadata row, describing optionality. Data rows start with "d,"

m,Version,v1.0
m,Type,xx
m,<OtherMetaData>,<...>
h,Col1,Col2,Col3,Col4,Col5,.............,Col100
m,Mandatory,Optional,Optional,...........,Mandatory
d,Val1,Val2,Val3,Val4,Val5,.............,Val100

是否可以在加载文件时跳过指定数量的行并对 DataSet 使用 'inferSchema' 选项?

Is it possible to skip a specified number of rows when loading the file and use 'inferSchema' option for DataSet?

Dataset<Row> df = spark.read()
            .format("csv")
            .option("header", "true")
            .option("inferSchema", "true")
            .load("\home\user\data\20170326.csv");

或者我是否需要定义两个不同的数据集并使用except(Dataset other)"来排除包含要忽略的行的数据集?

Or do I need to define two different Datasets and use "except(Dataset other)" to exclude the dataset with rows to be ignored?

推荐答案

您可以尝试将 "comment" 选项设置为 "m",有效地告诉 csv 阅读器跳过以 "m" 字符开头的行.

You can try setting the "comment" option to "m", effectively telling the csv reader to skip lines beginning with the "m" character.

df = spark.read()
          .format("csv")
          .option("header", "true")
          .option("inferSchema", "true")
          .option("comment", "m")
          .load("\home\user\data\20170326.csv")

这篇关于Apache Spark Dataframe - 从 CSV 文件的第 n 行加载数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆