Spark 读取 CSV - 不显示损坏的记录 [英] Spark read CSV - Not showing corroupt Records
问题描述
Spark 具有用于读取 CSV
文件的 Permissive
模式,该模式将 corroupt 记录存储到名为 _corroupt_record
的单独列中.
Spark has a Permissive
mode for reading CSV
files which stores the corroupt records into a separate column named _corroupt_record
.
宽容——遇到损坏的记录时将所有字段设置为空并将所有损坏的记录放在字符串列中称为_corrupt_record
permissive - Sets all fields to null when it encounters a corrupted record and places all corrupted records in a string column called _corrupt_record
但是,当我尝试以下示例时,我没有看到任何名为 _corroupt_record
的列.与模式不匹配的记录似乎是 null
However, when I am trying following example, I don't see any column named _corroupt_record
. the reocords which doesn't match with schema appears to be null
data.csv
data
10.00
11.00
$12.00
$13
gaurang
代码
import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType, DecimalType}
val schema = new StructType(Array(
new StructField("value", DecimalType(25,10), false)
))
val df = spark.read.format("csv")
.option("header", "true")
.option("mode", "PERMISSIVE")
.schema(schema)
.load("../test.csv")
架构
scala> df.printSchema()
root
|-- value: decimal(25,10) (nullable = true)
scala> df.show()
+-------------+
| value|
+-------------+
|10.0000000000|
|11.0000000000|
| null|
| null|
| null|
+-------------+
如果我将模式更改为 FAILFAST
,当我尝试查看数据时会出错.
If I change the mode to FAILFAST
I am getting error when I try to see data.
推荐答案
按照 Andrew 和 Prateek 的建议添加 _corroup_record
解决了该问题.
Adding the _corroup_record
as suggested by Andrew and Prateek resolved the issue.
import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType, DecimalType}
val schema = new StructType(Array(
new StructField("value", DecimalType(25,10), true),
new StructField("_corrupt_record", StringType, true)
))
val df = spark.read.format("csv")
.option("header", "true")
.option("mode", "PERMISSIVE")
.schema(schema)
.load("../test.csv")
查询数据
scala> df.show()
+-------------+---------------+
| value|_corrupt_record|
+-------------+---------------+
|10.0000000000| null|
|11.0000000000| null|
| null| $12.00|
| null| $13|
| null| gaurang|
+-------------+---------------+
这篇关于Spark 读取 CSV - 不显示损坏的记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!