您如何在 spark 数据框中找到列格式的异常? [英] How do you find anomalies in format of a column in a spark dataframe?

查看：39 发布时间：2021/11/14 22:27:05 regex scala apache-spark apache-spark-sql

本文介绍了您如何在 spark 数据框中找到列格式的异常?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

正如问题所说，我想在大型数据集中的列中查找值格式的异常.

例如:如果我在一个包含 5 亿行的数据集中有一个日期列，我想确保该列中所有行的日期格式都是 MM-DD-YYYY.我想找到这种格式中存在异常的计数和值.

我该怎么做呢?我可以使用正则表达式吗?有人可以举个例子吗?想使用 Spark Dataframe 做到这一点.

As the question says, I want to find anomalies in the format of the value in a column in a large dataset.

For example: if I have a date column within a dataset of say 500 million rows, I want to make sure that the date format for all rows in the column is MM-DD-YYYY. I want to find the count and the values where there is an anomaly in this format.

How do I do this? Can I use regex? Can someone give an example? Want to do this using Spark Dataframe.

推荐答案

使用正则表达式进行正确的日期格式验证可能很棘手(请参阅:Regex验证日期格式 dd/mm/yyyy)，但您可以使用 Joda-Time 如下:

Proper date format validation using regex can be tricky (See: Regex to validate date format dd/mm/yyyy), but you can use Joda-Time as below:

import scala.util.{Try, Failure}
import org.apache.spark.sql.functions.udf

object FormatChecker extends java.io.Serializable {
  val fmt = org.joda.time.format.DateTimeFormat forPattern "MM-dd-yyyy"
  def invalidFormat(s: String) = Try(fmt parseDateTime s) match {
    case Failure(_) => true
    case _ => false
  }
}

val df = sc.parallelize(Seq(
    "01-02-2015", "99-03-2010", "---", "2015-01-01", "03-30-2001")
).toDF("date")

invalidFormat = udf((s: String) => FormatChecker.invalidFormat(s))
df.where(invalidFormat($"date")).count()

这篇关于您如何在 spark 数据框中找到列格式的异常?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

您如何在 spark 数据框中找到列格式的异常? [英] How do you find anomalies in format of a column in a spark dataframe?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

您如何在 spark 数据框中找到列格式的异常? [英] How do you find anomalies in format of a column in a spark dataframe?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭