将字符串格式的科学记数法转换为火花数据帧中的数字 [英] convert scientific notation in string format to numeric in spark dataframe

查看:25
本文介绍了将字符串格式的科学记数法转换为火花数据帧中的数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Day_Date,timeofday_desc,Timeofday_hour,Timeofday_minute,Timeofday_second,value
2017-12-18,12:21:02 AM,0,21,2,"1.779209040E+08"
2017-12-19,12:21:02 AM,0,21,2,"1.779209040E+08"
2017-12-20,12:30:52 AM,0,30,52,"1.779209040E+08"
2017-12-21,12:30:52 AM,0,30,52,"1.779209040E+08"
2017-12-22,12:47:10 AM,0,47,10,"1.779209040E+08"
2017-12-23,12:47:10 AM,0,47,10,"1.779209040E+08"
2017-12-24,02:46:59 AM,2,46,59,"1.779209040E+08"
2017-12-25,02:46:59 AM,2,46,59,"1.779209040E+08"
2017-12-26,03:10:27 AM,3,10,27,"1.779209040E+08"
2017-12-27,03:10:27 AM,3,10,27,"1.779209040E+08"
2017-12-28,03:52:08 AM,3,52,8,"1.779209040E+08"

我正在尝试将 value 列转换为 177920904

I am trying to convert value column to 177920904

val df1 = df.withColumn("s", 'value.cast("Decimal(10,4)")).drop("value").withColumnRenamed("s", "value")

还尝试将值转换为 FloatDouble.总是得到 null 作为输出

also tried casting value as Float, Double. Always get null as output

df1.select("value").show()


+-----------+
|   value   |
+-----------+
|       null|
|       null|
|       null|
|       null|
|       null|
|       null|
|       null|
|       null|

<小时>

df.printSchema

root
 |-- Day_Date: string (nullable = true)
 |-- timeofday_desc: string (nullable = true)
 |-- Timeofday_hour: string (nullable = true)
 |-- Timeofday_minute: string (nullable = true)
 |-- Timeofday_second: string (nullable = true)
 |-- value: string (nullable = true)

推荐答案

只需要将其转换为十进制并有足够的空间来容纳数字即可.

Just need to cast it to decimal with enough room to fit the number.

Decimal 是 Decimal(precision, scale),所以 Decimal(10, 4) 表示总共 10 位,点左边 6 位,右边 4 位,所以数字不适合您的 Decimal 类型.

Decimal is Decimal(precision, scale), so Decimal(10, 4) means 10 digits in total, 6 at the left of the dot, and 4 to the right, so the number does not fit in your Decimal type.

来自文档

precision 表示可以得到的总位数代表

precision represents the total number of digits that can be represented

scale 表示小数位数.这个值必须是小于或等于精度.0 的比例产生积分值,没有小数部分

scale represents the number of fractional digits. This value must be less than or equal to precision. A scale of 0 produces integral values, with no fractional part

既然你不想右边有任何数字,你可以试试这个

Since you don't want any number to the right, you can try this

df.withColumn("s", 'value.cast("Decimal(10,0)"))

如果你想保留4位小数,你可以把它改成

If you want to keep 4 decimal digits, you can just change it to

df.withColumn("s", 'value.cast("Decimal(14,4)"))

输入

df.show
+---------------+
|          value|
+---------------+
|1.779209040E+08|
+---------------+

输出

scala> df.withColumn("s", 'value.cast("Decimal(10,0)")).show
+---------------+---------+
|          value|        s|
+---------------+---------+
|1.779209040E+08|177920904|
+---------------+---------+

完整解决方案

不删除也不重新命名

val df1 = df.withColumn("value", 'value.cast("Decimal(10,0)"))

修复输入数据

正如我在评论中所说,问题是您的数字周围包含一些奇怪的字符,您应该在投射之前将其删除

As I said in the comment, the problem is that your numbers contain some weird characters around them, you should remove it before casting

原创

scala> df.show
+----------+--------------+--------------+----------------+----------------+-----------------+
|  Day_Date|timeofday_desc|Timeofday_hour|Timeofday_minute|Timeofday_second|            value|
+----------+--------------+--------------+----------------+----------------+-----------------+
|2017-12-18|   12:21:02 AM|             0|              21|               2| ?1.779209040E+08|
|2017-12-19|   12:21:02 AM|             0|              21|               2|?1.779209040E+08?|
|2017-12-20|   12:30:52 AM|             0|              30|              52| ?1.779209040E+08|
|2017-12-21|   12:30:52 AM|             0|              30|              52| ?1.779209040E+08|
|2017-12-22|   12:47:10 AM|             0|              47|              10| ?1.779209040E+08|
|2017-12-23|   12:47:10 AM|             0|              47|              10| ?1.779209040E+08|
|2017-12-24|   02:46:59 AM|             2|              46|              59| ?1.779209040E+08|
|2017-12-25|   02:46:59 AM|             2|              46|              59| ?1.779209040E+08|
|2017-12-26|   03:10:27 AM|             3|              10|              27| ?1.779209040E+08|
|2017-12-27|   03:10:27 AM|             3|              10|              27| ?1.779209040E+08|
|2017-12-28|   03:52:08 AM|             3|              52|               8| ?1.779209040E+08|
+----------+--------------+--------------+----------------+----------------+-----------------+

有很多方法可以删除它们,一个快速的方法是使用 UDF 和正则表达式来删除除数字、字母、点、+ 和 - 之外的所有内容

There are many ways to remove them, a quick one is with an UDF and a regular expression to remove everything but numbers, letters, dot, + and -

 def clean(input: String) = input.replaceAll("[^a-zA-Z0-9\\+\\.-]", "")
 val cleanUDF = udf(clean _ )
df.withColumn("value", cleanUDF($"value").cast(DecimalType(10,0))).show
+----------+--------------+--------------+----------------+----------------+---------+
|  Day_Date|timeofday_desc|Timeofday_hour|Timeofday_minute|Timeofday_second|    value|
+----------+--------------+--------------+----------------+----------------+---------+
|2017-12-18|   12:21:02 AM|             0|              21|               2|177920904|
|2017-12-19|   12:21:02 AM|             0|              21|               2|177920904|
|2017-12-20|   12:30:52 AM|             0|              30|              52|177920904|
|2017-12-21|   12:30:52 AM|             0|              30|              52|177920904|
|2017-12-22|   12:47:10 AM|             0|              47|              10|177920904|
|2017-12-23|   12:47:10 AM|             0|              47|              10|177920904|
|2017-12-24|   02:46:59 AM|             2|              46|              59|177920904|
|2017-12-25|   02:46:59 AM|             2|              46|              59|177920904|
|2017-12-26|   03:10:27 AM|             3|              10|              27|177920904|
|2017-12-27|   03:10:27 AM|             3|              10|              27|177920904|
|2017-12-28|   03:52:08 AM|             3|              52|               8|177920904|
+----------+--------------+--------------+----------------+----------------+---------+

这篇关于将字符串格式的科学记数法转换为火花数据帧中的数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆