dask dataframe如何将列转换为to_datetime [英] dask dataframe how to convert column to to_datetime

查看:117
本文介绍了dask dataframe如何将列转换为to_datetime的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将数据框的一栏转换为日期时间.在这里进行讨论之后 https://github.com/dask/dask/issues/863 我尝试了以下代码:

I am trying to convert one column of my dataframe to datetime. Following the discussion here https://github.com/dask/dask/issues/863 I tried the following code:

import dask.dataframe as dd
df['time'].map_partitions(pd.to_datetime, columns='time').compute()

但是我收到以下错误消息

But I am getting the following error message

ValueError: Metadata inference failed, please provide `meta` keyword

我应该在meta底下放什么呢?我应该将所有列的字典放在df中还是仅将时间"列放在字典中?我应该放什么类型?我已经尝试过dtype和datetime64,但到目前为止它们都不起作用.

What exactly should I put under meta? should I put a dictionary of ALL the columns in df or only of the 'time' column? and what type should I put? I have tried dtype and datetime64 but none of them work so far.

感谢您和我的指导,

更新

我将在此处添加新的错误消息:

I will include here the new error messages:

1)使用时间戳记

df['trd_exctn_dt'].map_partitions(pd.Timestamp).compute()

TypeError: Cannot convert input to Timestamp

2)使用日期时间和元

2) Using datetime and meta

meta = ('time', pd.Timestamp)
df['time'].map_partitions(pd.to_datetime,meta=meta).compute()
TypeError: to_datetime() got an unexpected keyword argument 'meta'

3)仅使用日期时间:卡在2%

3) Just using date time: gets stuck at 2%

    In [14]: df['trd_exctn_dt'].map_partitions(pd.to_datetime).compute()
[                                        ] | 2% Completed |  2min 20.3s

此外,我希望能够在日期中指定格式,就像在大熊猫中一样:

Also, I would like to be able to specify the format in the date, as i would do in pandas:

pd.to_datetime(df['time'], format = '%m%d%Y'

更新2

更新到Dask 0.11之后,我不再遇到meta关键字的问题.不过,我无法在2GB的数据帧上超过2%.

After updating to Dask 0.11, I no longer have problems with the meta keyword. Still, I can't get it past 2% on a 2GB dataframe.

df['trd_exctn_dt'].map_partitions(pd.to_datetime, meta=meta).compute()
    [                                        ] | 2% Completed |  30min 45.7s

更新3

这种方式工作得更好:

def parse_dates(df):
  return pd.to_datetime(df['time'], format = '%m/%d/%Y')

df.map_partitions(parse_dates, meta=meta)

我不确定这是否是正确的方法

I'm not sure whether it's the right approach or not

推荐答案

使用astype

您可以使用astype方法将系列的dtype转换为NumPy dtype

Use astype

You can use the astype method to convert the dtype of a series to a NumPy dtype

df.time.astype('M8[us]')

也许还有一种方法可以指定Pandas风格的dtype(欢迎编辑)

There is probably a way to specify a Pandas style dtype as well (edits welcome)

使用像map_partitions这样的黑盒方法时,dask.dataframe需要知道输出的类型和名称. map_partitions的文档字符串中列出了几种方法.

When using black-box methods like map_partitions, dask.dataframe needs to know the type and names of the output. There are a few ways to do this listed in the docstring for map_partitions.

您可以提供一个具有正确dtype和名称的空Pandas对象

You can supply an empty Pandas object with the right dtype and name

meta = pd.Series([], name='time', dtype=pd.Timestamp)

或者您可以为Series提供元组(name, dtype)或为DataFrame提供dict

Or you can provide a tuple of (name, dtype) for a Series or a dict for a DataFrame

meta = ('time', pd.Timestamp)

那一切都应该没事

df.time.map_partitions(pd.to_datetime, meta=meta)

如果要在df上调用map_partitions,则需要为所有内容提供dtype.不过,在您的示例中情况并非如此.

If you were calling map_partitions on df instead then you would need to provide the dtypes for everything. That isn't the case in your example though.

这篇关于dask dataframe如何将列转换为to_datetime的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆