在 pandas 中处理日期-在日期时间中删除看不见的字符并转换为字符串 [英] working with dates in pandas - remove unseen characters in datetime and convert to string

查看:78
本文介绍了在 pandas 中处理日期-在日期时间中删除看不见的字符并转换为字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用熊猫导入数据dfST = read_csv( ... , parse_dates={'timestamp':[date]}) 在我的csv中,日期的格式为YYY/MM/DD,这是我所需要的-没有时间.我需要比较几个数据集的成员资格.当我将这些时间戳"转换为字符串时,有时会得到如下信息:

I am using pandas to import data dfST = read_csv( ... , parse_dates={'timestamp':[date]}) In my csv, date is in the format YYY/MM/DD, which is all I need - there is no time. I have several data sets that I need to compare for membership. When I convert theses 'timestamp' to a string, sometimes I get something like this:

'1977-07-31T00:00:00.000000000Z'

我理解的

是一个包含毫秒和时区的日期时间.有什么方法可以抑制导入过程中增加不必要的时间吗?如果没有,我需要以某种方式将其排除.

which I understand is a datetime including milliseconds and a timezone. Is there any way to suppress the addition of the extraneous time on import? If not, I need to exclude it somehow.

dfST.timestamp[1]
Out[138]: Timestamp('1977-07-31 00:00:00')

我尝试格式化它,直到我调用格式化的值之前,它似乎一直有效:

I have tried formatting it, which seemed to work until I called the formatted values:

dfSTdate=pd.to_datetime(dfST.timestamp, format="%Y-%m-%d")  
dfSTdate.head()
Out[123]: 
0   1977-07-31
1   1977-07-31
Name: timestamp, dtype: datetime64[ns]

但是没有...当我测试它的值时,我也得到了时间:

But no... when I test the value of this I also get the time:

dfSTdate[1]
Out[124]: Timestamp('1977-07-31 00:00:00')

当我将其转换为数组时,时间包括在毫秒和时区中,这确实弄乱了我的比较.

When I convert this to an array, the time is included with the millisecond and the timezone, which really messes my comparisons up.

test97=np.array(dfSTdate)
test97[1]
Out[136]: numpy.datetime64('1977-07-30T20:00:00.000000000-0400')

我该如何摆脱时间?!? 最终,我希望使用numpy.in1d并将日期作为字符串('YYYY-MM-DD')作为比较的一部分来比较数据集之间的成员资格

How can I get rid of the time?!? Ultimately I wish to compare membership among data sets using numpy.in1d with date as a string ('YYYY-MM-DD') as one part of the comparison

推荐答案

这是由于datetime值存储在熊猫中的方式所致:使用numpy datetime64[ns] dtype.因此,日期时间值始终以纳秒分辨率存储.即使您只有日期,该日期也将转换为零时间的纳秒级时间戳.这仅仅是由于在熊猫中的实现.

This is due to the way datetime values are stored in pandas: using the numpy datetime64[ns] dtype. So datetime values are always stored at nanosecond resolution. Even if you only have a date, this will be converted to a timestamp with a zero time of nanosecond resolution. This is just due to the implementation in pandas.

打印值和产生意外结果时遇到的问题仅仅是因为这些对象在python控制台(它们的表示形式)中的打印方式,而不是它们的实际值.
如果您打印单个值,则会得到熊猫的Timestamp表示形式:

The issues you have with printing the values and having unexpected results, is just because how these objects are printed in the python console (their representation), not their actual value.
If you print a single values, you get a the Timestamp representation of pandas:

Timestamp('1977-07-31 00:00:00')

因此,您也可以在此处获得秒数,只是因为这是默认表示形式.
如果将其转换为数组,然后进行打印,则将获得标准的numpy表示形式:

So you get the seconds here as well, just because this is the default representation.
If you convert it to an array, and then print it, you get the standard numpy representation:

numpy.datetime64('1977-07-30T20:00:00.000000000-0400')

这确实是一个非常令人误解的表述.因为numpy只是为了在控制台中进行打印,所以将其转换为本地时区.但这并不会改变您的实际价值,只是奇怪的打印.

This is indeed a very misleading representation. Because numpy will, just for printing it in the console, convert it to your local timezone. But this doesn't change your actual value, it's just weird printing.

现在是背景,现在回答您的问题,我该如何摆脱时间?
那取决于你的目标.您是否真的要将其转换为字符串?还是您不喜欢代表?

That is the background, now to answer your question, how do I get rid of the time?
That depends on your goal. Do you really want to convert it to a string? Or do you just don't like the repr?

  • 如果您只想使用datetime值,则不需要摆脱它.

如果要将其转换为字符串,则可以应用strfitme(df['timestamp'].apply(lambda x: x.strftime('%Y-%m-%d'))).或者,如果要将其作为字符串写入csv,请在to_csv

if you want to convert it to strings, you can apply strfitme (df['timestamp'].apply(lambda x: x.strftime('%Y-%m-%d'))). Or if it is to write it as strings to csv, use the date_format keyword in to_csv

如果您确实想要日期",则可以在DataFrame列中使用datetime.date类型(标准python类型).您可以使用pd.DatetimeIndex(dfST['timestamp']).date将现有的列转换为该列.但是我个人认为这没有很多优点.

if you really want a 'date', you can use the datetime.date type (standard python type) in a DataFrame column. You can convert your existing column to this with with: pd.DatetimeIndex(dfST['timestamp']).date. But personally I don't think this has many advantages.

这篇关于在 pandas 中处理日期-在日期时间中删除看不见的字符并转换为字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆