为什么读取带有空值的 csv 文件会导致 IndexOutOfBoundException? [英] Why does reading csv file with empty values lead to IndexOutOfBoundException?
问题描述
我有一个带有 foll 结构的 csv 文件
I have a csv file with the foll struct
Name | Val1 | Val2 | Val3 | Val4 | Val5
John 1 2
Joe 1 2
David 1 2 10 11
我可以将其加载到 RDD 中.我尝试创建一个架构,然后从中创建一个 Dataframe
并得到一个 indexOutOfBound
错误.
I am able to load this into an RDD fine. I tried to create a schema and then a Dataframe
from it and get an indexOutOfBound
error.
代码是这样的......
Code is something like this ...
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )
当我尝试对 rowRDD
执行操作时,出现错误.
When I tried to perform an action on rowRDD
, gives the error.
非常感谢任何帮助.
推荐答案
这不是您问题的答案.但它可能有助于解决您的问题.
This is not answer to your question. But it may help to solve your problem.
从问题中我看到您正在尝试从 CSV 创建数据框.
From the question I see that you are trying to create a dataframe from a CSV.
使用 spark-csv 包可以轻松地使用 CSV 创建数据框
Creating dataframe using CSV can be easily done using spark-csv package
使用spark-csv下面的scala代码可以读取CSVval df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csvFilePath)
With the spark-csv below scala code can be used to read a CSV
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csvFilePath)
对于您的示例数据,我得到以下结果
For your sample data I got the following result
+-----+----+----+----+----+----+
| Name|Val1|Val2|Val3|Val4|Val5|
+-----+----+----+----+----+----+
| John| 1| 2| | | |
| Joe| 1| 2| | | |
|David| 1| 2| | 10| 11|
+-----+----+----+----+----+----+
您还可以使用最新版本来推断Schema.请参阅此答案
You can also inferSchema with latest version. See this answer
这篇关于为什么读取带有空值的 csv 文件会导致 IndexOutOfBoundException?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!