Spark map / Filter抛出java.io.IOException:换行之前的字节太多:2147483648 [英] Spark map/Filter throws java.io.IOException: Too many bytes before newline: 2147483648
问题描述
我有一个简单的文件,大小为7 GB,其中包含由| .I分隔的两列的每行都创建了此文件的RDD,但是当我在此RDD上使用映射或过滤器转换时,我得到的太多字节异常。 / b>
下面是我的文件中的示例数据。
116010100000000007 | 33448
116010100000000014 | 13520
116010100000000021 | 97132
116010100000000049 | 82891
116010100000000049 | 82890
116010100000000056 | 93014
116010100000000063 | 43434 p>
116010100000000063 | 43434
这里是代码
val input = sparkContext.textFile(hdfsfilePath);
input.filter(x => x.split(|)(1).toInt> 15000).saveAsTextFile(hdfs://输出文件路径)
以下是我得到的异常。
java.io.IOException:换行之前的字节太多:2147483648
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249)
at org。 apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org。 apache.hadoop.mapred.LineRecordReader。< init>(LineRecordReader.java:136)
问题在于我的scala代码,同时用管道分隔符分隔线,我更改了代码,现在它正在工作。
以下是改变的代码。
val input = sparkContext.textFile(hdfsfilePath);
input.filter(x => x.split('|')(1).toInt> 15000).saveAsTextFile(hdfs://输出文件路径)
代替|我需要使用|或\\ |在拆分方法。
I am having a simple file of size 7 GB in which each line containing two column separated by |.I have created RDD from this file but when i use map or filter transformation on this RDD i gets too many byte exception.
below is sample data from my file .
116010100000000007|33448
116010100000000014|13520
116010100000000021|97132
116010100000000049|82891
116010100000000049|82890
116010100000000056|93014
116010100000000063|43434
116010100000000063|43434
here is the code
val input = sparkContext.textFile("hdfsfilePath");
input.filter(x=>x.split("|")(1).toInt > 15000).saveAsTextFile("hdfs://output file path")
Below is the Exception i am getting .
java.io.IOException: Too many bytes before newline: 2147483648
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:94)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:136)
Issue was with my scala code while splitting line with pipe delimiter ,i have changed the code and now it is working. below is changed code .
val input = sparkContext.textFile("hdfsfilePath");
input.filter(x=>x.split('|')(1).toInt > 15000).saveAsTextFile("hdfs://output file path")
instead of "|" i neeed to use either '|' or "\\|" in split method.
这篇关于Spark map / Filter抛出java.io.IOException:换行之前的字节太多:2147483648的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!