如何在转换期间测试数据类型转换 [英] How to test datatype conversion during casting
问题描述
我们有一个将数据映射到数据帧的脚本(我们使用的是 pyspark).数据以字符串的形式出现,有时还会对其进行一些其他昂贵的操作,但作为操作(调用 withColumn)的一部分,我们对其最终数据类型进行了强制转换.
We have a script that maps data into a dataframe (we're using pyspark). The data comes in as a string, and some other sometimes expensive stuff is done to it, but as part of the operation (calling withColumn) we do a cast to it's final data type.
我需要判断是否发生了截断,但如果发生了截断,我们不想失败.我们只需要一个数字来了解每个翻译列(大约 300 列)中有多少行失败.
I have a requirement to tell if truncation occurred, but we don't want to fail if it does. We just want a number to know how many rows in each translated column (there are about 300 columns) failed.
我的第一个想法是让每一列通过一个 UDF 来进行测试,输出将是一个包含值的数组,以及一个关于它是否通过数据类型检查的值.然后我会做2个选择.一个从数组中选择原始值,另一个聚合未命中.但这似乎是一个草率的解决方案.我对 pyspark/hadoop 世界还很陌生……很想知道是否有更好的(也许是标准的?)方法来做到这一点.
My first thought was to have each column pass through a UDF which would do the test, and the output would be an array with the value, and a value about if it passed the datatype checks. I'd then do 2 selections. One selects the raw values from the array, and the other aggregates the misses. But this seems like a sloppy solution. I'm fairly new to the pyspark/hadoop world... would love to know if there's a better (maybe standard?) way to do this.
推荐答案
在最新的 Spark 版本中,在 Spark 中转换数字不会失败,也不会导致无提示溢出 - 如果值的格式不正确,或者值太大被目标类型容纳,结果未定义 - NULL
.
In the latest Spark versions casting numbers in Spark doesn't fail and doesn't result in silent overflows - if value is not properly formatted, or is to large to be accommodated by the target type, the result is undefined - NULL
.
所以你所要做的就是简单地计算 NULL
值(计算非 NaN 条目的数量在 cast
之后在 Spark 数据帧的每一列中使用 Pyspark):
So all you have to do is simple count of NULL
values (Count number of non-NaN entries in each column of Spark dataframe with Pyspark) after cast
:
from pyspark.sql.functions import count
df = spark.createDataFrame(['132312312312312321312312', '123', '32'], 'string')
df_cast = df.withColumn('value_casted' , df['value'].cast('integer'))
df_cast.select((
# count('value') - count of NOT NULL values before
# count('value_casted') - count of NOT NULL values after
count('value') - count('value_casted')).alias('value_failed')
).show()
# +------------+
# |value_failed|
# +------------+
# | 1|
# +------------+
这篇关于如何在转换期间测试数据类型转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!