使用Spark CSV软件包读取非常大的文件时出错 [英] Error while reading very large files with spark csv package

查看:98
本文介绍了使用Spark CSV软件包读取非常大的文件时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在尝试使用spark-csv和univocity 1.5.0解析器读取一个3 gb文件,该文件的一列中有多个换行符,但是基于换行符.在大文件的情况下会发生这种情况.

我们正在使用spark 1.6.1和scala 2.10

以下我用于读取文件的代码:

sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") 
    .option("inferSchema", "true")
    .option("mode","FAILFAST")
    .option("escape","\"")
    .option("qoute"."\"")
    .option("parserLib","univocity")
    .load("abc.csv")

java.lang.exception:在2015年1月20日失败.

示例文件: "A AAAAAAAA","AA999","AA999","AA999","9999-99-99-99.99.99.999999","AAAAAA99","Aaaaa Aaaaaaaa

99/99/9999-AAA Aaaaaaa Aa:aaaaaaaaa aa A aaaaa,aaaaaaaaa aaa aaaaaaa aaaaaaaaaaaa

Aaa aaaaa aa AAA aaa aaaaaaaaaaaaa

99/99/9999 Aaaaa aaaaaaaa-aa aaaaaaaa aaaaaaaaaaa aaaaaaaaa aaaaa aaa aaaaaa aa aaaaaaaaaaa aaaaaa aaaaaaaa aaaaaaaaaaa

99/99/9999 aaa'a aaaaaa a/aaa aaaaaaa-AAA aaaaaaaaa aaa'a aaaaaaa

99/99/9999 AAA aaaaaa-aaaaaaa aaaaaaaaa

99/99/9999 AAA aaaaaa. Aaa aaaa Aa. Aaaaaa Aa:aaaaaaaaaaaaaaaaaaaaaaaaa,aaaaaaaaaaaaaaaaaaaaaaaaaa,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa(aaaaaaaaaaaaaaaaaaaaaaaaaa). A& Aa aaaaaa aa aaaaaaaaaaaa aaaa aaaaaa aaaa aaaaa aaaaaaaaaaaa aaaaaaaa aaaaa aaa aaaaa aaaaaaaa aaaaa aaaa aaaaa aaaaaaaaaaaa啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊.

99/99/9999-aaaaa aaaaaaaa.

99/99/9999-AAA

99/99/9999 AAA aaaaaa aaaaa aa aaa 9999 aaaa aaaaaaaaa aaaaaaaaaa-aa A& Aa. Aaaaaaaaaa aaaaa aaaaaa.

99/99/9999 AAA aaaaa aaaaaa-aa aaaaaaaaa aaaaaa aaaaaaaa aa AAA aa AAA aaa aa aaaaaa aaaaaa aaaa-aaaaaaaaaaaaa Aa aaaaaaaaaa aaaaaaaa A& Aa aaaaa aaaaaaa aaaaaaa

99/99/9999-Aaaaaa aaaaaa aaaa. Aaaaaaaa aaaa aaaa 99/99/9999-99/99/9999

99/99/9999-aaaaaa aaaaaaaaa aaa Aaa aa:AAAA aaaaa aaaa aaaaaa aaaa aaaa aaaaa aaaaaaaaaaaaaaa

99/99/9999 Aaaaaa a/aaa aaaaaaaaa. Aaaa aaaaaaaaa aaaaaaaaaaaaaaa aa AA.

99/99/9999 Aaaaaa aaaaaa aaaaaa aaaa.

99/99/9999 Aaaaaaaa aaaaaa aa aaaaaa aaaa

99/99/9999 Aaaaaa a/aaa aaaaaaa aaa'a aaaaaaaaa aaaaaaaaaaa aaaaaaa

99/99/9999 AAA aaaaaa A& Aa aaaaaa aaa aaaaaaaaaaaaaa aaa aaaaa aaaaaa

99/99/9999 AAA aaaaa aaaaaa-aaaaa aaaaaaaaaaaaaaaaa aaa aaaaaaaaaaaaaa aaaaaaaaaaaaa. Aaa aaaaaa aaaaaaaaa aaaaaaaa aaaaaaaaa aaaaaaaaaa

99/99/9999 AAA aaaaaa aaaaaaa aaaa aaaaaa aaaa aaa 9. a& Aa aaaaaaaa aaaaaa aaaaa aaaa aaaa aaaaaaaaaa,aaaaaaaaaa aaaa aaaaaaaa aaa aaaa aaaaa aaaaaaaa aaaaaaaa.

99/99/9999 AAA-aaaaaaaaaaaaa aaaaaaaaaa.

AAA aaaaaaaaa aaaaaaaaaaa aaaaaaa aaaa aaaaaaaaaaaaa aaaaa aaaaa aaaaaa aa aaaaaaa aaaaaa aaaaaaaaa aaaaa aaaaaaaaaaaa aa aaaaa

99/99/9999 AAA aaaaa aaaaaa-Aaaaaaaaaaaaaa aaaaaa aa 99/99/9999 aaaaaa aaaa aaaa aaaaa aaa aaaaaaaaaa a/aaaaaaaaa aaaaaaaaa aaaaaaaa. aaa aaaaaaaaaaaaaa aaaa aa aa 99/99/9999 aaa aaa aaaaaaaaaaa aaaaaaaaaaaaaaa aaaaa aaaaaaaaaaaaaaaaa aaa aaaaaaa aa aaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaa aaaaaaaaaaa aaaaaaaaaaaa.

99/99/9999 AAA aaaa aaaaa-AAA aaaaaaa aaaa A& Aa aaaaaaaaaa aa aaa aaaaaaaaaaaa aaaaa aaaa aaaa aaaaaaa aa Aaaaa 9999.

Aaaaaaaaa aaaaaa aaaaaa aa aa Aaa 9,9999 aaa aaaaaaa aaaaaaaa aaaaa aaa aaaaaaaa aaaa Aa. Aaaaaaaa aaa aaaaa aaaaaa aaaaaaaa aaaaaaaa aa A& Aa aaa aaaaaaaaaa aaaaaaa aaaa aaaa.啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊.

99/99/9999 Aaaaaaa aaa'a aaaa AA

99/99/9999-a/a aaaa aa aaaaaaaaaaaaaa

99/99/9999 Aaaaaaa aaa'a aaaa aaaaaaaaaaaaaa

99/99/9999-aaaa aaaaaa aa aaaaaaaaaaaa aaaaaaaa aaa aaa aaaaaaaaaa 99/99/9999-aaa aaaa aa aaaaaaaaaaaaaa aaaaaa aaa aaaaaaaaaaaa aaaa aaaa aaaaa aaaa aaaa aaa Aaa Aaa 99,9999 aaaaa aaa aaa aaaaaaaaaa

99/99/9999-aaaa aaa'a aaaa aaaaaaaaaaaa aaaaaaaa aa aaaa aaaa aaaaaaa aaaaaaaaaaa 99/99/9999-aaaa aaaaaa aa aaaaaaaaaaaaaa aa:a/a aaaaa aa aa aaaa. Aaaaaaaaa aaaaaaa aaa aaaaaa aaaa aaa aaaaaaaaaaa aaa aaa aaaaaaa aaa aa aaaaaaaa aa aaaaa. aaa aaaa aaa aaaa aaaaa aaaaa aaaaaaaa aaa aaaa aaaa aaa aaaa aa Aaaaaaaaa. Aaaa aaa aa aaaaa a/a aaaaa aaaaa. Aaa aaaaaaaa aaaaaa aaaaa aaaaa.

99/99/9999-Aaaaa AAA aaaaaa aaaaaaaa. aaaaaaaaa aaaa aaaa aaaaa aaaaaaa aaaaaaaaaa aaa Aaaaa Aaaaaaaaaa Aaaaaaaaaa Aaaaa aaa aaaaa aaaa aaaaaaa aaaaaaaa aa aaa aaaaaaaaaa aaaaaaaaaaaaa aaaaaaaaaa aaaaaaaaa aaaaaaa(aaaaa aa aaaaaaaaaa aa Aaaa 9999). Aaaa aa aaaaa aa aaaa aa aaaaaa aa aaaa. aaa aaaaaaaa aaa aaaaaaaaaaaa aa aaaaaaaaaaaa aa aaaaaaaaa aaaaaaaaaa aaaaa aaaaa aa aaaaaaa aaaaaaaaaaaaa aaaaa aaa aaaaaaaaa aaa aaaaaaaaa aaaaa aa aaaaaaaaa aaa aaaaa Aa Aaa 9999,Aa. Aaaaaaaa aaaaa aaaaaaaaaaa aaa aaaaaaaa aaaaaaaaaa,aaa aa aaaa aaa aaaaaaa aa aaaa aaa aa aaa aaaaaaaaaa. Aa aa aaaaa aa Aa. aaaa aaaaaaaaaaaa aaaaaaaa aaaaaaaaa aaaa aaaa. Aaa A/Aa aaa aaaaa aaaaa Aaa 9999 aaaaa aaaa aaaaaaaa aaaa aa aaaaaaaaaaaa,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa,aaaaaaaa,aaaaaaaaaa,aaaaaaaaaaaaaaaa,aaaaaaaaaaaaaaa. aaaaaaaaa:aaaaa aaa aaaaaaaaaa aaaaaaaa aa aaaa aaaaa,aaaaaaa aaaa aaa aaaaaaaaaa aaaaaaa aaaaaaaaa aaa aaaa aa aaaa aaaaaaaaaa aaaaaa aaaaaaaaa aaaaaaa aaaaaaaaaaaaaaaa "

解决方案

Spark的CSV关系基于其TextBasedFileFormat,并且仅逐行查看输入,因此它不支持多行记录.如果需要支持多行记录,则可以使用wholeTextFiles代替,然后手动解析输入(但理想情况下,这应该作为预处理数据清理作业来完成).

We are trying to read a 3 gb file which has multiple new line character in one its column using spark-csv and univocity 1.5.0 parser, but the file is getting split in the multiple column in some row on the basis of newline character. This scenario is occurring in case of large file.

We are using spark 1.6.1 and scala 2.10

Following code i'm using for reading the file :

sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") 
    .option("inferSchema", "true")
    .option("mode","FAILFAST")
    .option("escape","\"")
    .option("qoute"."\"")
    .option("parserLib","univocity")
    .load("abc.csv")

java.lang.exception: FAILFAST at 01/20/2015 .

Sample File : "A AAAAAAAA","AA999","AA999","AA999","9999-99-99-99.99.99.999999","AAAAAA99","Aaaaa Aaaaaaaa

99/99/9999 - AAA Aaaaaaa Aa: aaaaaaaaa aa A aaaaa, aaaaaaaa aaa aaaaaaa aaaaaaaaaa

Aaa aaaaa aa AAA aaa aaaaaaaaaaa

99/99/9999 Aaaaa aaaaaa - aa aaaaaaaa aaaaaaaaa aaaaaaaa aaaaa aaa aaaaaa aa aaaaaaaaaa aaaaaa aa aaaaaaa aaaaaaaaa.

99/99/9999 Aaa'a aaaaaa a/ aaa aaaaaaa - AAA aaaaaaaaa aaa'a aaaaaaa

99/99/9999 AAA aaaaaa - aaaaaaa aaaaaaaaa

99/99/9999 AAA aaaaaa. Aaa aaaa Aa. Aaaaaa Aa: aaaaaaaaa aaaaaaaa aaaaaa, A aaaaaaa aaaa aaaaaaaaaa, aaaaa aaaaaaa aaaa aaaaaaaaaa (aaaa aaaaaaaaaaaa aaaaaaa). A&Aa aaaaaa aa aaaaaaaaaa aaa aaaa aaaaaa aaaa aaaaa aa aaaaaaaaa, A aaaaaaaa aaaaa aaa aaaaa aaaaaaaa aaaaa aaaa aaaaa aa aaaaaaaaa. Aaa aaaaaa aaaaaa aaaaaa aaaa aaaaaa.

99/99/9999 - aaaaa aaaaaaaa.

99/99/9999 - AAA

99/99/9999 AAA aaaaaa aaaaa aa Aaa 9999 aaaa aaaaaaaaa aaaaaaaaaa - aa A&Aa. Aaaaaaaaaa aaaaa aaaaaa.

99/99/9999 AAA aaaaa aaaaaa - aa aaaaaaa aa aaaaa aaaaaa aa AAA aa AAA aaa aa aaaaaa aaaaaa aaaa-aaaaaaaaaaa. Aa aaaaaaaa aa aaaaaa A&Aa aaaaa aa aaaaa aaaaaaa.

99/99/9999 - Aaaaaa aaaaaa aaaa. Aaaaaaaa aaaa aaaa 99/99/9999 - 99/99/9999

99/99/9999 - aaaaaa aaaaaaa aa AAAA aa: AAAA aaaaa aaaa aaaaaa aaaa aaaa aaaaa aa aaa aaaaaaaaa.

99/99/9999 Aaaaaa a/ aaa aaaaaaa. Aaaa aaaaaaaa aa aaaaaaaaaaaa aa AA.

99/99/9999 Aaaaaa aaaaaa aaaaaa aaaa.

99/99/9999 Aaaaaaaa aaaaaa aa aaaaaa aaaa

99/99/9999 Aaaaaa a/ aaa aaaaaaa aaa'a aaaaaaaaa aaaaaaaaaaa aaaaaaa

99/99/9999 AAA aaaaaa A&Aa aaaaaa aaa aaaaaaaaaaaaaa aaa aaaaa aaaaaa

99/99/9999 AAA aaaaa aaaaaa - aaaaa aaaaaaaaaaaaaaa aaa aaaaaaaaaaaa aa aaaaaaaaaaa. Aaa aaaaaa aaaaaaaaa aaaaaaaa aaaaaaaaa aaaaaaaa aaa aaaa aaaaaa aa aaaaaa aaaaaa aaaaa aaaa aa aaaaaa aaa aaaaaaaa aaaaaaaaa A&Aa aaa aaaaaaaaa, aaaaaaaaa aaaaa aaaaaaaaa.

99/99/9999 AAA aaaaaa aaaaaaa aaaa aaaaaa aa Aaa 9. A&Aa aaaaaa aa aaaaa aaaaa aaaa aaaaaaaa, aaaaaaaaaa aaaa aaaaaaaa aaa aaaa aaaaa aaaaaaaa aaaaaa.

99/99/9999 AAA - aaaaaaaaaaa aaaaaaaaaa.

AAA aaaaaaaaa aaaaaaaaaa aaaaaaa aaaa aaaaaaaaaaaa aaaaa aa aaaa aaaaaa aa aaaaaaa aa aaaaa aaaaaaaaa aaaaa aa aaaaaaaaaaa aa aaaa.

99/99/9999 AAA aaaaa aaaaaa - Aaaaaaaaaaaa aaaaaa aa 99/99/9999 aaaaaa aaaa aaaa aaaaa aaa aaaaaaaaaa a/ aaaaaaaaa aaaaaaaaa aaaaaaaa. Aaa aaaaaaaaaaaa aaaa aa 99/99/9999 aaa aaa aaaaaaaaaaa aaaaaaaaaaaaa aaaaa 99/99/9999 aaaa aaa aaaaaaa aa aaaaaaaaa aaaaaaaa, aaaaaa AAA aa aaaaa aaaaaaaaa aa aa 99. Aa aaa aaaaaaa aa aaaaaaaaa aaaaaaaa, aaa aaaaaaaaaa aaaaaaaa aaaaa aaaa aaaaaaaaaaa aaaa aaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa.

99/99/9999 AAA aaaa aaaaa - AAA aaaaaaa aaaa A&Aa aaaaaaaaaa aa aaa aaaaaaaaaaaa aaaaa aaaa aaaa aaaaaaa aa Aaaaa 9999.

Aaaaaaaaa aaaaaa aa aaaaa aa aa Aaa 9, 9999 aaa aaaaaaa aaaaaaaa aaaaa aaa aaaaaaaa aaaa Aa. Aaaaaaaa aaa aaaa aaaaaa aa aaaaaaa aaaaaa aa A&Aa aaa aaaaaaaa aaaaaa aaaa aaaa. Aaaa aa aaaaaaa aaa aaaaa aa aaaaaaaaaa aaaa aaa aaaaaaaaaa aa aaaaa aa aaaaaaaaaa aaaaa aa aaaaaaaaaaaa.

99/99/9999 Aaaaaaa aaa'a aaaa AA

99/99/9999 - a/a aaaa aa aaaaaaaaaaaa

99/99/9999 Aaaaaaa aaa'a aaaa aaaaaaaaaaaa

99/99/9999 - aaaa aaaaaa aa aaaaaaaaaaaa aaaaaaaa aaa aaa aaaaaaaaaa 99/99/9999 - aaa aaaa aa aaaaaaaaaaaa aaaaaa aaa aaaaaaaaaaaa aaaa aaaa aaaaa aaaa aaaa aaa Aaa 99, 9999 aaaaa aaa aaa aaaaaaaaaa

99/99/9999 - aaaa aaa'a aaaa aaaaaaaaaaaa aaaaaaaa aa aaaa aaaa aaaaaaa aaaaaaaaaaa 99/99/9999 - aaaa aaaaaa aa aaaaaaaaaaaa aa: a/a aaaa aa aa aaaa. Aaaaaaaaa aaaaaaa aaa aaaaaa aaaa aaa aaaaaaaaaaa aaa aaa aaaaaaa aaa aa aaa aaaaaa aa aaaaa. aaa aaaa aaa aaaa aaaaa aaaaa aaaaaaaa aaa aaaa Aaaa aaa aaaa aa Aaaaaaaaa. Aaaa aaa aa aaaaa a/a aaaaa aaaaa. Aaa aaaaaa aa aaaa aaaaa aaaaa.

99/99/9999 - Aaaaa AAA aaaaaa aaaaaaaa. Aaaaaaaaa aaaa aaaa aaaaa aaaaaaa aaaaaaaa aaa Aaaaa Aaaaaaaaaa Aaaaaaaa, aaaaaaa, aaaaa aa a aaa aa aaaa aaaa aaaaaaa aa aaaaaaaa aa aaaaaaa, aaaa aaaaa, aaa aaaaaa, aaaa aa aaaaaaaa, aaaa aa aaaaaaaaaa, aaaaaaa aaaaa aaaaaa. Aaaaa aaa aaaaa aaaa aaaaaaa aaaaaaaa aa aaa aaaaaaaaaa aaaaaaaaaaa aa aaaaaaaa aaaaaaaaa aaaaaaa (aaaaa aa aaaaaaaaaa aa Aaaa 9999). Aaaa aa aaaaa aa aaaa aa aaaaaa aa aaaa. Aaa aaaaaaaa aaa aaaaaaaaaa aa a aaaaaaaaaa aa aaaaaaaa aaaaaaaa, aaaaaa aaaaa aa aaa aaaaaa aaaaaaaaaaa aaaaa aaa aaaaaaaa aa aaa aaaaaaaa aaaaa aa Aaa 9999 aa aaa aaaaaaa aa aaaaaaa aa aaaaaaa aaaaaaaa. Aa Aaa 9999, Aa. Aaaaaaaa aaaaa aaaaaaaaaa aaa aaaaaaaa aaaaaaaa, aaa aa aaaa aaa aaaaaaa aa aaaa aaa aa aaa aaaaaaaa. Aa aa aaaaa aa Aa. A aaaa aaaaaaaaaa aaaaaaaa aaaaaaaaa aaaa aaaa. Aaa A/Aa aaa aaaaa aaaaa Aaa 9999 aaaaa aaaa aaaaaaaa aaaa aa aaaaaaaaaa, aaaa aa aaaaaaaaaaaaa aaa aaaaaaaaa, aaaaaaa, aaaaaaaaa, aaaaaaaaa aaaa, aaaaaaaaaaaaa. Aaaaaaaaa: Aaaaa aaa aaaaaaaa aa aaaaaaa aa aaaa aaaaa, aaaaaaa aaaa aaa aa-aaaaaa aaaaaaa aaaaaaaaa aaa aaaa aa aaaa aaaaaaaa aa aaaaa aaaaaaaaa aaaaaaa aa aaaa-aaaaaaaaaa aaaaaaaaaa, aaa aaaaaaaaa aaaaaaa aaaa. "

解决方案

Spark's CSV relation is based on its TextBasedFileFormat and only looks at the input on a line-by-line basis, so it does not support multi-line records. If you need to support multi-line records you can look at using wholeTextFiles instead and manually parsing the input (but ideally this should be done as a pre-processing data cleanup job).

这篇关于使用Spark CSV软件包读取非常大的文件时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆