PySpark:将时间戳转换为数据帧中的整数时不一致 [英] PySpark: inconsistency in converting timestamp to integer in dataframe
问题描述
我的数据框结构大致如下:
I have a dataframe with a rough structure like the following:
+-------------------------+-------------------------+--------+
| timestamp | adj_timestamp | values |
+-------------------------+-------------------------+--------+
| 2017-05-31 15:30:48.000 | 2017-05-31 11:30:00.000 | 0 |
+-------------------------+-------------------------+--------+
| 2017-05-31 15:31:45.000 | 2017-05-31 11:30:00.000 | 0 |
+-------------------------+-------------------------+--------+
| 2017-05-31 15:32:49.000 | 2017-05-31 11:30:00.000 | 0 |
...
我正在尝试将转换函数应用于两个时间列,以使用time
包将其转换为整数表示.我的用户定义函数及其如何应用于上述数据框:
I am trying to apply a conversion function to the two time columns to turn them into their integer representation using the time
package. My user defined function and how it is applied to the dataframe above:
def timeConverter(timestamp):
time_tuple = time.strptime(timestamp, "%Y-%m-%d %H:%M:%S.000")
timevalue = time.mktime(time_tuple)
return timevalue
def convertDateColumn(Data):
timeUDF = udf(timeConverter,FloatType())
finalData = Data.withColumn('adj_timestamp', timeUDF('adj_timestamp'))
return finalData
例如,adj_timestamp
列中的第一个条目变为:
1496244608
For example, the first entry in the adj_timestamp
column becomes:
1496244608
通过datetime.fromtimestamp
将此转换回:2017-05-31 15:30:08
与我开始时的价值不一样...对发生的事情感到好奇!
Which is not the same value that I started with... Curious as to what is going on!
由于我的行比所示的3行多得多,是否有可能异步处理了数据,因此结果数据帧的顺序与输入的顺序不同吗?
Since I have far more rows than the 3 shown, is it possible that the data is being processed asynchronously and therefore the resulting dataframe is not in the same order as it was fed in?
推荐答案
对于udf
,我不太确定为什么它不起作用.将Python函数转换为UDF时,可能是浮动操作问题.请参阅下面的使用interger输出的工作方式.或者,您可以使用名为unix_timestamp
的Spark函数进行解析,该函数可以转换时间戳.我在下面举一个例子.希望能有所帮助.
For udf
, I'm not quite sure yet why it's not working. It might be float manipulation problem when converting Python function to UDF. See how using interger output works below. Alternatively, you can resolve using a Spark function called unix_timestamp
that allows you convert timestamp. I give an example below. Hope it helps a bit.
在这里,我根据您显示的示例创建Spark数据框,
Here I create Spark dataframe from examples that you show,
import pandas as pd
df = pd.DataFrame([
['2017-05-31 15:30:48.000', '2017-05-31 11:30:00.000', 0],
['2017-05-31 15:31:45.000', '2017-05-31 11:30:00.000', 0],
['2017-05-31 15:32:49.000', '2017-05-31 11:30:00.000', 0]],
columns=['timestamp', 'adj_timestamp', 'values'])
df = spark.createDataFrame(df)
使用Spark函数解决
将fn.unix_timestamp
应用于列timestamp
import pyspark.sql.functions as fn
from pyspark.sql.types import *
df.select(fn.unix_timestamp(fn.col('timestamp'), format='yyyy-MM-dd HH:mm:ss.000').alias('unix_timestamp')).show()
对于第一列,输出看起来像这样
For the first column, the output looks like this
+--------------+
|unix_timestamp|
+--------------+
| 1496259048|
| 1496259105|
| 1496259169|
+--------------+
您可以使用datetime
库将其恢复为时间戳记:
You can put this back to timestamp using datetime
library:
import datetime
datetime.datetime.fromtimestamp(1496259048) # output as datetime(2017, 5, 31, 15, 30, 48)
通过转换为整数而不是浮点数来解决
import datetime
import time
def timeConverter(timestamp):
time_tuple = datetime.datetime.strptime(timestamp, "%Y-%m-%d %H:%M:%S.000").timetuple()
timevalue = int(time.mktime(time_tuple)) # convert to int here
return timevalue
time_udf = fn.udf(timeConverter, IntegerType()) # output interger
df.select(time_udf(fn.col('timestamp')))
在这里,我们将获得与使用unix_timestamp
相同的时间戳记[1496259048, 1496259105, 1496259169]
.
Here, we will get the same timestamp [1496259048, 1496259105, 1496259169]
as using unix_timestamp
.
这篇关于PySpark:将时间戳转换为数据帧中的整数时不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!