将(年、月、日、小时、分钟、秒)的嵌套值分解为 Pyspark 数据框中一个字段中的日期时间类型 [英] explode Nested Values of (year,month,day, hours,minute,second) to a date time type in one field in Pyspark Dataframe
问题描述
我正在尝试将嵌套字段转换为 DATETIME 类型的字段,当我使用爆炸函数时出现错误:由于数据类型不匹配,无法解析explode(START_Time
)">
我拥有的数据:
|-- MODEL: string (nullable = true)|-- START_Time: struct (nullable = true)||-- 天:字符串(可为空 = 真)||-- 小时:字符串(可为空 = 真)||-- 分钟:字符串(可为空 = 真)||-- 月:字符串(可为空 = 真)||-- 第二个:字符串(可为空 = 真)||-- 年:字符串(可为空 = 真)|-- 权重:字符串(可为空 = 真)|-- 注册:结构(可为空 = 真)||-- 天:字符串(可为空 = 真)||-- 小时:字符串(可为空 = 真)||-- 分钟:字符串(可为空 = 真)||-- 月:字符串(可为空 = 真)||-- 第二个:字符串(可为空 = 真)||-- 年:字符串(可为空 = 真)|-- 总计:字符串(可为空 = 真)
我想要的结果:使用 START_TIME 和 REGISTRED 作为 DATE 类型
+---------+------------------+----------+-----------------+---------+|型号 |START_时间 |重量 |已注册 |总计 |+---------+----------+----------+-----------------+---------+|.........|yy-mm-dd-hh-mm-ss|重量 |yy-mm-dd-hh-mm-ss|总计 |
我试过了:
df.withColumn('START_Time', concat(col('START_Time.year'), lit('-'), .....)
但是当嵌套字段中有空值时,它会在 (-----)它让我:
+---------+------------------+----------+-----------------+---------+|型号 |START_时间 |重量 |已注册 |总计 |+---------+----------+----------+-----------------+---------+|价值|----- |价值 |----- |价值|
拼接后,您只需将整列强制转换为 timestamp
类型,Spark 将为您处理丢失(和无效)的数据并返回 null
代替
from pyspark.sql import 函数为 F(df.withColumn('raw_string_date', F.concat(F.col('START_TIME.year'),掠过('-'),F.col('START_TIME.month'),掠过('-'),F.col('START_TIME.day'),F.lit(' '),F.col('START_TIME.hour'),掠过(':'),F.col('START_TIME.minute'),掠过(':'),F.col('START_TIME.second'),)).withColumn('date_type', F.col('raw_string_date').cast('timestamp')).show(10, 假))# +------------------------------------+----------------+--------------------+# |START_TIME |raw_string_date|date_type |# +------------------------------------+----------------+--------------------+# |{1, 2, 3, 4, 5, 2021} |2021-4-1 2:3:5 |2021-04-01 02:03:05|# |{, , , , , } |-- :: |null |# |{null, null, null, null, null, null}|null |null |# +------------------------------------+----------------+--------------------+
I'm Trying to convert the nested Fields into one Field of DATETIME type when i use explode function i get an error : cannot resolve 'explode(START_Time
)' due to data type mismatch
data i have :
|-- MODEL: string (nullable = true)
|-- START_Time: struct (nullable = true)
| |-- day: string (nullable = true)
| |-- hour: string (nullable = true)
| |-- minute: string (nullable = true)
| |-- month: string (nullable = true)
| |-- second: string (nullable = true)
| |-- year: string (nullable = true)
|-- WEIGHT: string (nullable = true)
|-- REGISTRED: struct (nullable = true)
| |-- day: string (nullable = true)
| |-- hour: string (nullable = true)
| |-- minute: string (nullable = true)
| |-- month: string (nullable = true)
| |-- second: string (nullable = true)
| |-- year: string (nullable = true)
|-- TOTAL: string (nullable = true)
Result i'm looking to have : with START_TIME and REGISTRED as DATE type
+---------+------------------+----------+-----------------+---------+
|MODEL | START_Time | WEIGHT |REGISTED |TOTAL |
+---------+------------------+----------+-----------------+---------+
|.........| yy-mm-dd-hh-mm-ss| WEIGHT |yy-mm-dd-hh-mm-ss|TOTAL |
i have tried :
df.withColumn('START_Time', concat(col('START_Time.year'), lit('-'), .....)
but when there are empty values in the nested fiels it gets (-----) in and it gets me :
+---------+------------------+----------+-----------------+---------+
|MODEL | START_Time | WEIGHT |REGISTED |TOTAL |
+---------+------------------+----------+-----------------+---------+
|value | ----- | value | ----- |value |
After concatenating, you can just cast the entire column to timestamp
type, Spark will handle the missing (and invalid) data for you and return null
instead
from pyspark.sql import functions as F
(df
.withColumn('raw_string_date', F
.concat(
F.col('START_TIME.year'),
F.lit('-'),
F.col('START_TIME.month'),
F.lit('-'),
F.col('START_TIME.day'),
F.lit(' '),
F.col('START_TIME.hour'),
F.lit(':'),
F.col('START_TIME.minute'),
F.lit(':'),
F.col('START_TIME.second'),
)
)
.withColumn('date_type', F.col('raw_string_date').cast('timestamp'))
.show(10, False)
)
# +------------------------------------+---------------+-------------------+
# |START_TIME |raw_string_date|date_type |
# +------------------------------------+---------------+-------------------+
# |{1, 2, 3, 4, 5, 2021} |2021-4-1 2:3:5 |2021-04-01 02:03:05|
# |{, , , , , } |-- :: |null |
# |{null, null, null, null, null, null}|null |null |
# +------------------------------------+---------------+-------------------+
这篇关于将(年、月、日、小时、分钟、秒)的嵌套值分解为 Pyspark 数据框中一个字段中的日期时间类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!