pandas 日期时间格式 [英] Pandas datetime formatting

查看:76
本文介绍了 pandas 日期时间格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以用后缀零表示pd.to_datetime?似乎零已被删除.

Is it possible to represent a pd.to_datetime with suffix of zeros? It seems the zeros are being removed.

print pd.to_datetime("2000-07-26 14:21:00.00000",
                format="%Y-%m-%d %H:%M:%S.%f")

结果是

2000-07-26 14:21:00

所需的结果将是

2000-07-26 14:21:00.00000

我知道这些值具有相同的含义,但是对于保持一致性很有帮助.

I know the values mean the same thing but it would be nice for consistency.

推荐答案

进行一些测试表明,当使用format =%H:%M:%S.%f"格式化日期时间数据时,%f可以如果小数点后第九位为非零,则为纳秒分辨率.在格式化字符串时,将根据从小数点后的最低有效数字的位置,加上给定的最后一个数字,来添加从零到五的可变数量的尾随零.这是一张来自测试数据的表格,其中position是最低有效非零的位置,也是最后一位的位置,零是通过格式化添加的尾随零的数量:

Doing some testing shows that when formatting date-time data with format="%H:%M:%S.%f", %f is capable of nanosecond resolution provided the ninth digit after the decimal place is non-zero. When formatting a string, a variable number of trailing zeros from none to five are added depending on the position of the least significant digit after the decimal point and given that its also the final digit. Here's a table of that from test data where position is the position of the least significant non-zero and also final digit and zeros is the number of trailing zeros added by formatting:

    position zeros
       9      0
       8      1
       7      2
       6      0
       5      1
       4      2
       3      3
       2      4
       1      5

当一列的整体格式为%H:%M:%S.%f"时,其所有元素的小数点后位数均相同,这可以通过添加或删除来实现尾随零,即使这会增加或减少原始数据的分辨率.我猜想这是出于一致性和美观的原因,通常不会引入过多的误差,因为在数值计算中,结尾的零通常不会影响立即结果,但是它们会影响对误差以及误差的表示方式(

When a column is formatted with "%H:%M:%S.%f" as a whole, all of its elements will have the same number of digits after the decimal point, which may be done by adding or removing trailing zeros even if that increases or decreases the resolution of raw data. I guess reasons for this are consistency and pleasing aesthetics without usually introducing excessive error, since in numeric calculations trailing zeros don't typically affect immediate results, however they can affect estimation of their error and how they should be presented (Trailing Zeros, Rules for Significant Figures).

以下是将%H:%M:%S.%f"格式应用于带有pandas.to_datetime的单个字符串和pandas.Series(DataFrame列)并应用pandas.DataFrame.convert_objects(convert_dates ='coerce ')到具有可转换为datetime的列的DataFrames.

Below are some observations of applying "%H:%M:%S.%f" format to individual strings and pandas.Series (DataFrame columns) with pandas.to_datetime and applying pandas.DataFrame.convert_objects(convert_dates='coerce') to DataFrames with a column that could be converted to datetime.

在字符串上,pandas使用%H:%M:%S.%f"在时间转换中保留最多零个小数点后第九位的非零数字,并在未提供的情况下添加日期:

On a string pandas preserves a non-zero digit in up to the ninth decimal place in time conversion with "%H:%M:%S.%f" and adds a date if a one was not provided:

import pandas as pd
pd.to_datetime ("10:00:00.000000001",format="%H:%M:%S.%f")
Out[15]: Timestamp('1900-01-01 10:00:00.000000001')

pd.to_datetime ("2015-09-17 10:00:00.000000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[15]: Timestamp('2015-09-17 10:00:00.000000001')

在此之前,对于最终非零数字是最终数字的测试,它会在最终非零数字后加至五个尾随零,从而提高原始数据的分辨率,除非最终非零数字在小数点后第6位:

Prior to that and for tests in which the final non-zero digit is the final digit, it adds up to five trailing zeros after the final non-zero digit increasing the resolution of the raw data except when the final non-zero digit is in position six to the right of the decimal place:

pd.to_datetime ("10:00:00.00000001",format="%H:%M:%S.%f")
Out[15]: Timestamp('1900-01-01 10:00:00.000000010')

pd.to_datetime ("2015-09-17 10:00:00.00000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[16]: Timestamp('2015-09-17 10:00:00.000000010')

pd.to_datetime ("10:00:00.0000001",format="%H:%M:%S.%f")
Out[15]: Timestamp('1900-01-01 10:00:00.000000100')

pd.to_datetime ("2015-09-17 10:00:00.0000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[17]: Timestamp('2015-09-17 10:00:00.000000100')

pd.to_datetime ("10:00:00.000001",format="%H:%M:%S.%f")
Out[33]: Timestamp('1900-01-01 10:00:00.000001')

pd.to_datetime ("2015-09-17 10:00:00.000001",format="%Y-%m-%d %H:%M:%S.%f")
Out[18]: Timestamp('2015-09-17 10:00:00.000001')

pd.to_datetime ("10:00:00.00001",format="%H:%M:%S.%f")
Out[6]: Timestamp('1900-01-01 10:00:00.000010')

pd.to_datetime ("2015-09-17 10:00:00.00001",format="%Y-%m-%d %H:%M:%S.%f")
Out[19]: Timestamp('2015-09-17 10:00:00.000010')

pd.to_datetime ("10:00:00.0001",format="%H:%M:%S.%f")
Out[9]: Timestamp('1900-01-01 10:00:00.000100')

pd.to_datetime ("2015-09-17 10:00:00.0001",format="%Y-%m-%d %H:%M:%S.%f")
Out[21]: Timestamp('2015-09-17 10:00:00.000100')

pd.to_datetime ("10:00:00.001",format="%H:%M:%S.%f")
Out[10]: Timestamp('1900-01-01 10:00:00.001000')

pd.to_datetime ("2015-09-17 10:00:00.001",format="%Y-%m-%d %H:%M:%S.%f")
Out[22]: Timestamp('2015-09-17 10:00:00.001000')

pd.to_datetime ("10:00:00.01",format="%H:%M:%S.%f")
Out[12]: Timestamp('1900-01-01 10:00:00.010000')

pd.to_datetime ("2015-09-17 10:00:00.01",format="%Y-%m-%d %H:%M:%S.%f")
Out[24]: Timestamp('2015-09-17 10:00:00.010000'

pd.to_datetime ("10:00:00.1",format="%H:%M:%S.%f")
Out[13]: Timestamp('1900-01-01 10:00:00.100000')

pd.to_datetime ("2015-09-17 10:00:00.1",format="%Y-%m-%d %H:%M:%S.%f")
Out[26]: Timestamp('2015-09-17 10:00:00.100000')

让我们看看它如何与DataFrame一起使用:

Let see how it works with a DataFrame:

!type test.csv # here type is Windows substitute for Linux cat command
date,mesg
10:00:00.000000001,one
10:00:00.00000001,two
10:00:00.0000001,three
10:00:00.000001,four
10:00:00.00001,five
10:00:00.0001,six
10:00:00.001,seven
10:00:00.01,eight
10:00:00.1,nine
10:00:00.000000001,ten
10:00:00.000000002,eleven
10:00:00.000000003,twelve

df = pd.read_csv('test.csv')
df
Out[30]: 
                  date    mesg
0   10:00:00.000000001     one
1    10:00:00.00000001     two
2     10:00:00.0000001   three
3      10:00:00.000001    four
4       10:00:00.00001    five
5        10:00:00.0001     six
6         10:00:00.001   seven
7          10:00:00.01   eight
8           10:00:00.1    nine
9   10:00:00.000000001     ten
10  10:00:00.000000002  eleven
11  10:00:00.000000003  twelve

df.dtypes
Out[31]: 
date    object
mesg    object
dtype: object

具有convert_objects的DataFrame的日期时间转换(不具有格式选项),即使某些原始数据的分辨率小于或大于该分辨率,并提供今天的日期,也可以提供微秒的分辨率:

Datetime conversion of a DataFrame with convert_objects, which does not have format option, provides microsecond resolution even when some raw data has resolution less or more than that and adds today's date:

df2 = df.convert_objects(convert_dates='coerce')
df2
Out[32]: 
                     date    mesg
0  2015-09-17 10:00:00.000000     one
1  2015-09-17 10:00:00.000000     two
2  2015-09-17 10:00:00.000000   three
3  2015-09-17 10:00:00.000001    four
4  2015-09-17 10:00:00.000010    five
5  2015-09-17 10:00:00.000100     six
6  2015-09-17 10:00:00.001000   seven
7  2015-09-17 10:00:00.010000   eight
8  2015-09-17 10:00:00.100000    nine
9  2015-09-17 10:00:00.000000     ten
10 2015-09-17 10:00:00.000000  eleven
11 2015-09-17 10:00:00.000000  twelve

df2.dtypes
Out[33]: 
date    datetime64[ns]
mesg            object
dtype: object

使用原始数据创建的DataFrame列中的元素值具有更高的分辨率,其中某些分辨率大于微秒,无法在日期时间转换后使用%H:%M:%S.%f"格式进行格式化,而没有显式格式说明符(与DataFrame.convert_objects一起使用):

Greater resolution of element values in a DataFrame column created from raw data some of which has greater than microsecond resolution is not recoverable with "%H:%M:%S.%f" formatting after datetime conversion done without an explicit format specifier (that is with DataFrame.convert_objects):

df2['date'] = pd.to_datetime(df2['date'],format="%H:%M:%S.%f")
df2
Out[34]: 
                         date    mesg
0  2015-09-17 10:00:00.000000     one
1  2015-09-17 10:00:00.000000     two
2  2015-09-17 10:00:00.000000   three
3  2015-09-17 10:00:00.000001    four
4  2015-09-17 10:00:00.000010    five
5  2015-09-17 10:00:00.000100     six
6  2015-09-17 10:00:00.001000   seven
7  2015-09-17 10:00:00.010000   eight
8  2015-09-17 10:00:00.100000    nine
9  2015-09-17 10:00:00.000000     ten
10 2015-09-17 10:00:00.000000  eleven
11 2015-09-17 10:00:00.000000  twelve

如果至少一个元素的第九位具有非零数字,则在日期时间转换之前使用%H:%M:%S.%f"格式化DataFrame枚举可提供纳秒级分辨率(如 pandas.to_datetime文档),但还可以通过小于十亿分之一秒的分辨率,并添加1900-01-01作为日期:

Formatting a DataFrame colume with "%H:%M:%S.%f" before datetime conversion provides nanosecond resolution if at least one element has a non-zero digit in the ninth place (as advertised in pandas.to_datetime documentation), but also increases the resolution of raw data with less than nanosecond resolution to that level and adds 1900-01-01 as the date:

df3 = df.copy(deep=True)
df3['date'] = pd.to_datetime(df3['date'],format="%H:%M:%S.%f",coerce=True)
df3
Out[35]:
                            date    mesg
0  1900-01-01 10:00:00.000000001     one
1  1900-01-01 10:00:00.000000010     two
2  1900-01-01 10:00:00.000000100   three
3  1900-01-01 10:00:00.000001000    four
4  1900-01-01 10:00:00.000010000    five
5  1900-01-01 10:00:00.000100000     six
6  1900-01-01 10:00:00.001000000   seven
7  1900-01-01 10:00:00.010000000   eight
8  1900-01-01 10:00:00.100000000    nine
9  1900-01-01 10:00:00.000000001     ten
10 1900-01-01 10:00:00.000000002  eleven
11 1900-01-01 10:00:00.000000003  twelve

使用%H:%M:%S.%f"格式化DataFrame列,在数据后添加零,并在小数点后添加最低有效非零数字(在整个列中,并根据位置添加零:zeros表),并将所有其他数据的分辨率与此对齐,即使这样做会增加或降低某些原始数据的分辨率:

Formatting a DataFrame column with "%H:%M:%S.%f" adds zeros after the datum with the least significant non-zero digit after decimal point (over the whole column and zeros are added according to the position:zeros table above) and aligns the resolution of all other data with that even if doing so increases or decreases the resolution of some raw data:

df4 = pd.read_csv('test2.csv')
df4
Out[36]: 
                  date    mesg
0   10:00:00.000000000     one
1    10:00:00.00000000     two
2     10:00:00.0000000   three
3      10:00:00.000000    four
4       10:00:00.00000    five
5        10:00:00.0001     six
6          10:00:00.00   seven
7           10:00:00.0   eight
8            10:00:00.    nine
9   10:00:00.000000000     ten
10  10:00:00.000000000  eleven
11   10:00:00.00000000  twelve

df4['date'] = pd.to_datetime(df4['date'],format="%H:%M:%S.%f",coerce=True)
df4
Out[37]: 
                         date    mesg
0  1900-01-01 10:00:00.000000     one
1  1900-01-01 10:00:00.000000     two
2  1900-01-01 10:00:00.000000   three
3  1900-01-01 10:00:00.000000    four
4  1900-01-01 10:00:00.000000    five
5  1900-01-01 10:00:00.000100     six
6  1900-01-01 10:00:00.000000   seven
7  1900-01-01 10:00:00.000000   eight
8                         NaT    nine # nothing after decimal point in raw data
9  1900-01-01 10:00:00.000000     ten
10 1900-01-01 10:00:00.000000  eleven
11 1900-01-01 10:00:00.000000  twelve

当尝试使用相同的DataFrame但日期列中包含日期时,发生了相同的事情:

When attempting this with the same DataFrame with but with dates included in the date column, the same thing happened:

df25
Out[38]: 
                             date    mesg
0   2015-09-10 10:00:00.000000000     one
1    2015-09-11 10:00:00.00000000     two
2     2015-09-12 10:00:00.0000000   three
3      2015-09-13 10:00:00.000000    four
4       2015-09-14 10:00:00.00000    five
5        2015-09-15 10:00:00.0001     six
6          2015-09-16 10:00:00.00   seven
7           2015-09-17 10:00:00.0   eight
8            2015-09-18 10:00:00.    nine
9   2015-09-19 10:00:00.000000000     ten
10  2015-09-20 10:00:00.000000000  eleven
11   2015-09-21 10:00:00.00000000  twelve

df25['date'] = pd.to_datetime(df25['date'],format="%Y-%m-%d %H:%M:%S.%f",coerce=True)
df25
Out[39]: 
                         date    mesg
0  2015-09-10 10:00:00.000000     one
1  2015-09-11 10:00:00.000000     two
2  2015-09-12 10:00:00.000000   three
3  2015-09-13 10:00:00.000000    four
4  2015-09-14 10:00:00.000000    five
5  2015-09-15 10:00:00.000100     six
6  2015-09-16 10:00:00.000000   seven
7  2015-09-17 10:00:00.000000   eight
8                         NaT    nine # nothing after decimal point in raw data
9  2015-09-19 10:00:00.000000     ten
10 2015-09-20 10:00:00.000000  eleven
11 2015-09-21 10:00:00.000000  twelve

如果没有原始数据的小数点后有一个非零的有效数字,则使用DataFrame列%H:%M:%S.%f"进行格式化可能会统一为所有小数点在小数点后仅提供两个零.数据,即使这会增加或减少某些原始数据的分辨率:

When no raw datum has a non-zero significant digit after the decimal point, formatting with a DataFrame column "%H:%M:%S.%f" may uniformly provide just two zeros after the decimal point for all the data even when that increases or decreases the resolution of some raw data:

df5 = pd.read_csv('test3.csv')
df5
Out[40]: 
                  date    mesg
0         10:00:00.000     one
1           10:00:00.0     two
2         10:00:00.000   three
3         10:00:00.000    four
4          10:00:00.00    five
5         10:00:00.000     six
6          10:00:00.00   seven
7           10:00:00.0   eight
8           10:00:00.0    nine
9   10:00:00.000000000     ten
10        10:00:00.000  eleven
11        10:00:00.000  twelve

df5['date'] = pd.to_datetime(df5['date'],format="%H:%M:%S.%f",coerce=True)
df5
Out[41]: 
                  date    mesg
0  1900-01-01 10:00:00     one
1  1900-01-01 10:00:00     two
2  1900-01-01 10:00:00   three
3  1900-01-01 10:00:00    four
4  1900-01-01 10:00:00    five
5  1900-01-01 10:00:00     six
6  1900-01-01 10:00:00   seven
7  1900-01-01 10:00:00   eight
8  1900-01-01 10:00:00    nine
9  1900-01-01 10:00:00     ten
10 1900-01-01 10:00:00  eleven
11 1900-01-01 10:00:00  twelve

使用相同的DataFrame进行此测试时发生了相同的事情,但日期列中包含了日期:

The same thing happened when doing this test with the same DataFrame but with dates included in the date column:

df45
Out[42]: 
                             date    mesg
0         2015-09-10 10:00:00.000     one
1           2015-09-11 10:00:00.0     two
2         2015-09-12 10:00:00.000   three
3         2015-09-13 10:00:00.000    four
4          2015-09-14 10:00:00.00    five
5         2015-09-15 10:00:00.000     six
6          2015-09-16 10:00:00.00   seven
7           2015-09-17 10:00:00.0   eight
8           2015-09-18 10:00:00.0    nine
9   2015-09-19 10:00:00.000000000     ten
10        2015-09-20 10:00:00.000  eleven
11        2015-09-21 10:00:00.000  twelve

df45['date'] = pd.to_datetime(df45['date'],format="%Y-%m-%d %H:%M:    %S.%f",coerce=True)
df45
Out[43]: 
                  date    mesg
0  2015-09-10 10:00:00     one
1  2015-09-11 10:00:00     two
2  2015-09-12 10:00:00   three
3  2015-09-13 10:00:00    four
4  2015-09-14 10:00:00    five
5  2015-09-15 10:00:00     six
6  2015-09-16 10:00:00   seven
7  2015-09-17 10:00:00   eight
8  2015-09-18 10:00:00    nine
9  2015-09-19 10:00:00     ten
10 2015-09-20 10:00:00  eleven
11 2015-09-21 10:00:00  twelve

这篇关于 pandas 日期时间格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆