如何在转换期间测试数据类型转换 [英] How to test datatype conversion during casting

查看:32
本文介绍了如何在转换期间测试数据类型转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个将数据映射到数据帧的脚本(我们使用的是 pyspark).数据以字符串的形式出现,有时还会对其进行一些其他昂贵的操作,但作为操作(调用 withColumn)的一部分,我们对其最终数据类型进行了强制转换.

We have a script that maps data into a dataframe (we're using pyspark). The data comes in as a string, and some other sometimes expensive stuff is done to it, but as part of the operation (calling withColumn) we do a cast to it's final data type.

我需要判断是否发生了截断,但如果发生了截断,我们不想失败.我们只需要一个数字来了解每个翻译列(大约 300 列)中有多少行失败.

I have a requirement to tell if truncation occurred, but we don't want to fail if it does. We just want a number to know how many rows in each translated column (there are about 300 columns) failed.

我的第一个想法是让每一列通过一个 UDF 来进行测试,输出将是一个包含值的数组,以及一个关于它是否通过数据类型检查的值.然后我会做2个选择.一个从数组中选择原始值,另一个聚合未命中.但这似乎是一个草率的解决方案.我对 pyspark/hadoop 世界还很陌生……很想知道是否有更好的(也许是标准的?)方法来做到这一点.

My first thought was to have each column pass through a UDF which would do the test, and the output would be an array with the value, and a value about if it passed the datatype checks. I'd then do 2 selections. One selects the raw values from the array, and the other aggregates the misses. But this seems like a sloppy solution. I'm fairly new to the pyspark/hadoop world... would love to know if there's a better (maybe standard?) way to do this.

推荐答案

在最新的 Spark 版本中,在 Spark 中转换数字不会失败,也不会导致无提示溢出 - 如果值的格式不正确,或者值太大被目标类型容纳,结果未定义 - NULL.

In the latest Spark versions casting numbers in Spark doesn't fail and doesn't result in silent overflows - if value is not properly formatted, or is to large to be accommodated by the target type, the result is undefined - NULL.

所以你所要做的就是简单地计算 NULL 值(计算非 NaN 条目的数量在 cast 之后在 Spark 数据帧的每一列中使用 Pyspark):

So all you have to do is simple count of NULL values (Count number of non-NaN entries in each column of Spark dataframe with Pyspark) after cast:

from pyspark.sql.functions import count

df = spark.createDataFrame(['132312312312312321312312', '123', '32'], 'string')
df_cast = df.withColumn('value_casted' , df['value'].cast('integer'))

df_cast.select((
    # count('value')         - count of NOT NULL values before
    # count('value_casted')  - count of NOT NULL values after
    count('value') - count('value_casted')).alias('value_failed')
).show()
# +------------+
# |value_failed|
# +------------+
# |           1|
# +------------+

这篇关于如何在转换期间测试数据类型转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆