Spark 读取 CSV - 不显示损坏的记录 [英] Spark read CSV - Not showing corroupt Records

查看:30
本文介绍了Spark 读取 CSV - 不显示损坏的记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark 具有用于读取 CSV 文件的 Permissive 模式,该模式将 corroupt 记录存储到名为 _corroupt_record 的单独列中.

Spark has a Permissive mode for reading CSV files which stores the corroupt records into a separate column named _corroupt_record.

宽容——遇到损坏的记录时将所有字段设置为空并将所有损坏的记录放在字符串列中称为_corrupt_record

permissive - Sets all fields to null when it encounters a corrupted record and places all corrupted records in a string column called _corrupt_record

但是,当我尝试以下示例时,我没有看到任何名为 _corroupt_record 的列.与模式不匹配的记录似乎是 null

However, when I am trying following example, I don't see any column named _corroupt_record. the reocords which doesn't match with schema appears to be null

data.csv

data
10.00
11.00
$12.00
$13
gaurang

代码

import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType, DecimalType}
val schema = new StructType(Array(
new StructField("value", DecimalType(25,10), false)
))
val df = spark.read.format("csv") 
  .option("header", "true") 
  .option("mode", "PERMISSIVE") 
  .schema(schema) 
  .load("../test.csv")

架构

scala> df.printSchema()
root
 |-- value: decimal(25,10) (nullable = true)


scala> df.show()
+-------------+
|        value|
+-------------+
|10.0000000000|
|11.0000000000|
|         null|
|         null|
|         null|
+-------------+

如果我将模式更改为 FAILFAST,当我尝试查看数据时会出错.

If I change the mode to FAILFAST I am getting error when I try to see data.

推荐答案

按照 Andrew 和 Prateek 的建议添加 _corroup_record 解决了该问题.

Adding the _corroup_record as suggested by Andrew and Prateek resolved the issue.

import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType, DecimalType}
val schema = new StructType(Array(
  new StructField("value", DecimalType(25,10), true),
  new StructField("_corrupt_record", StringType, true)
))
val df = spark.read.format("csv") 
  .option("header", "true") 
  .option("mode", "PERMISSIVE") 
  .schema(schema) 
  .load("../test.csv")

查询数据

scala> df.show()
+-------------+---------------+
|        value|_corrupt_record|
+-------------+---------------+
|10.0000000000|           null|
|11.0000000000|           null|
|         null|         $12.00|
|         null|            $13|
|         null|        gaurang|
+-------------+---------------+

这篇关于Spark 读取 CSV - 不显示损坏的记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆