PySpark:将时间戳转换为数据帧中的整数时不一致 [英] PySpark: inconsistency in converting timestamp to integer in dataframe

查看:132
本文介绍了PySpark:将时间戳转换为数据帧中的整数时不一致的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据框结构大致如下:

I have a dataframe with a rough structure like the following:

+-------------------------+-------------------------+--------+
| timestamp               | adj_timestamp           | values |
+-------------------------+-------------------------+--------+
| 2017-05-31 15:30:48.000 | 2017-05-31 11:30:00.000 | 0      |
+-------------------------+-------------------------+--------+
| 2017-05-31 15:31:45.000 | 2017-05-31 11:30:00.000 | 0      |
+-------------------------+-------------------------+--------+
| 2017-05-31 15:32:49.000 | 2017-05-31 11:30:00.000 | 0      |
...

我正在尝试将转换函数应用于两个时间列,以使用time包将其转换为整数表示.我的用户定义函数及其如何应用于上述数据框:

I am trying to apply a conversion function to the two time columns to turn them into their integer representation using the time package. My user defined function and how it is applied to the dataframe above:

def timeConverter(timestamp):
    time_tuple = time.strptime(timestamp, "%Y-%m-%d %H:%M:%S.000")
    timevalue = time.mktime(time_tuple)
    return timevalue

def convertDateColumn(Data):
    timeUDF = udf(timeConverter,FloatType())
    finalData = Data.withColumn('adj_timestamp', timeUDF('adj_timestamp'))

    return finalData

例如,adj_timestamp列中的第一个条目变为: 1496244608

For example, the first entry in the adj_timestamp column becomes: 1496244608

通过datetime.fromtimestamp将此转换回:2017-05-31 15:30:08

与我开始时的价值不一样...对发生的事情感到好奇!

Which is not the same value that I started with... Curious as to what is going on!

由于我的行比所示的3行多得多,是否有可能异步处理了数据,因此结果数据帧的顺序与输入的顺序不同吗?

Since I have far more rows than the 3 shown, is it possible that the data is being processed asynchronously and therefore the resulting dataframe is not in the same order as it was fed in?

推荐答案

对于udf,我不太确定为什么它不起作用.将Python函数转换为UDF时,可能是浮动操作问题.请参阅下面的使用interger输出的工作方式.或者,您可以使用名为unix_timestamp的Spark函数进行解析,该函数可以转换时间戳.我在下面举一个例子.希望能有所帮助.

For udf, I'm not quite sure yet why it's not working. It might be float manipulation problem when converting Python function to UDF. See how using interger output works below. Alternatively, you can resolve using a Spark function called unix_timestamp that allows you convert timestamp. I give an example below. Hope it helps a bit.

在这里,我根据您显示的示例创建Spark数据框,

Here I create Spark dataframe from examples that you show,

import pandas as pd

df = pd.DataFrame([
    ['2017-05-31 15:30:48.000', '2017-05-31 11:30:00.000', 0], 
    ['2017-05-31 15:31:45.000', '2017-05-31 11:30:00.000', 0],
    ['2017-05-31 15:32:49.000', '2017-05-31 11:30:00.000', 0]], 
    columns=['timestamp', 'adj_timestamp', 'values'])
df = spark.createDataFrame(df)

使用Spark函数解决

fn.unix_timestamp应用于列timestamp

import pyspark.sql.functions as fn
from pyspark.sql.types import *
df.select(fn.unix_timestamp(fn.col('timestamp'), format='yyyy-MM-dd HH:mm:ss.000').alias('unix_timestamp')).show()

对于第一列,输出看起来像这样

For the first column, the output looks like this

+--------------+
|unix_timestamp|
+--------------+
|    1496259048|
|    1496259105|
|    1496259169|
+--------------+

您可以使用datetime库将其恢复为时间戳记:

You can put this back to timestamp using datetime library:

import datetime
datetime.datetime.fromtimestamp(1496259048) # output as datetime(2017, 5, 31, 15, 30, 48)

通过转换为整数而不是浮点数来解决

import datetime
import time

def timeConverter(timestamp):
    time_tuple = datetime.datetime.strptime(timestamp, "%Y-%m-%d %H:%M:%S.000").timetuple()
    timevalue = int(time.mktime(time_tuple)) # convert to int here
    return timevalue

time_udf = fn.udf(timeConverter, IntegerType()) # output interger

df.select(time_udf(fn.col('timestamp'))) 

在这里,我们将获得与使用unix_timestamp相同的时间戳记[1496259048, 1496259105, 1496259169].

Here, we will get the same timestamp [1496259048, 1496259105, 1496259169] as using unix_timestamp.

这篇关于PySpark:将时间戳转换为数据帧中的整数时不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆