将 pandas 列转换为包括缺少值的datetime64 [英] Converting pandas columns to datetime64 including missing values

查看:89
本文介绍了将 pandas 列转换为包括缺少值的datetime64的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

与熊猫一起使用某些基于时间序列的数据,其中包含日期,数字,类别等。

Working with Pandas to work with some timeseries based data that contains dates, numbers, categories etc.

我遇到的问题是让熊猫来处理我的问题从CSV创建的DataFrame中正确显示日期/时间列。我的数据中有18个日期列,它们不是连续的,原始CSV中的未知值的字符串值为未知。有些列中的所有单元格都具有有效的日期时间,并且可以通过pandas read_csv方法正确猜出其dtype。但是,有些列中的特定数据样本中的所有单元格都为未知,并且将其键入为对象。

The problem I'm having is getting pandas to deal with my date/time columns correctly from a DataFrame created from a CSV. There are 18 date columns in my data, they are not continuous and unknown values in the raw CSV have a string value of "Unknown". Some columns have ALL cells with a valid datetime in it and correctly get their dtype guessed by the pandas read_csv method. There are some columns however that in a particular data sample have ALL cells as "Unknown" and these get typed as object.

我加载CSV的代码如下:

My code to load the CSV is as follows:

self.datecols = ['Claim Date', 'Lock Date', 'Closed Date', 'Service Date', 'Latest_Submission', 'Statement Date 1', 'Statement Date 2', 'Statement Date 3', 'Patient Payment Date 1', 'Patient Payment Date 2', 'Patient Payment Date 3', 'Primary 1 Payment Date', 'Primary 2 Payment Date', 'Primary 3 Payment Date', 'Secondary 1 Payment Date', 'Secondary 2 Payment Date', 'Tertiary Payment Date']
self.csvbear = pd.read_csv(file_path, index_col="Claim ID", parse_dates=True, na_values=['Unknown'])
self.csvbear = pd.DataFrame.convert_objects(self.csvbear, convert_dates='coerce')
print self.csvbear.dtypes
print self.csvbear['Tertiary Payment Date'].values

print的输出self.csvbear.dtypes

The output from print self.csvbear.dtypes

Prac                            object
Doctor Name                     object
Practice Name                   object
Specialty                       object
Speciality Code                  int64
Claim Date              datetime64[ns]
Lock Date               datetime64[ns]
Progress Note Locked            object
Aging by Claim Date              int64
Aging by Lock Date               int64
Closed Date             datetime64[ns]
Service Date            datetime64[ns]
Week Number                      int64
Month                   datetime64[ns]
Current Insurance               object
...
Secondary 2 Deductible        float64
Secondary 2 Co Insurance      float64
Secondary 2 Member Balance    float64
Secondary 2 Paid              float64
Secondary 2 Witheld           float64
Secondary 2 Ins                object
Tertiary Payment Date          object
Tertiary Payment ID           float64
Tertiary Allowed              float64
Tertiary Deductible           float64
Tertiary Co Insurance         float64
Tertiary Member Balance       float64
Tertiary Paid                 float64
Tertiary Witheld              float64
Tertiary Ins                  float64
Length: 96, dtype: object
[nan nan nan ..., nan nan nan]
Press any key to continue . . .

您可以看到,第三次付款日期col应该是datetime64 dtype,但这只是一个对象,其实际内容仅为NaN(从read_csv函数将其放入字符串'Unknown')。

As you can see, the Tertiary Payment Date col should be a datetime64 dtype, but it's simply a object, and the actual content of it is just NaN (put there from the read_csv function for string 'Unknown').

如何可靠地将所有日期列转换为以datetime64作为dtype和未知单元格的NaT?

How can I reliably convert all of the date columns to have datetime64 as a dtype and NaT for 'Unknown' cells?

推荐答案

如果您有一个全南的列,则 read_csv 不会正确地强制它。最简单的方法就是执行此操作(如果已存在datetime64 [ns]的列将直接通过)。

if you have an all-nan column it won't be coerced properly by read_csv. easiest is just to do this (which if a column is already datetime64[ns] will just pass thru).

In [3]: df = DataFrame(dict(A = Timestamp('20130101'), B = np.random.randn(5), C = np.nan))

In [4]: df
Out[4]: 
                    A         B   C
0 2013-01-01 00:00:00 -0.859994 NaN
1 2013-01-01 00:00:00 -2.562136 NaN
2 2013-01-01 00:00:00  0.410673 NaN
3 2013-01-01 00:00:00  0.480578 NaN
4 2013-01-01 00:00:00  0.464771 NaN

[5 rows x 3 columns]

In [5]: df.dtypes
Out[5]: 
A    datetime64[ns]
B           float64
C           float64
dtype: object

In [6]: df['A'] = pd.to_datetime(df['A'])

In [7]: df['C'] = pd.to_datetime(df['C'])

In [8]: df
Out[8]: 
                    A         B   C
0 2013-01-01 00:00:00 -0.859994 NaT
1 2013-01-01 00:00:00 -2.562136 NaT
2 2013-01-01 00:00:00  0.410673 NaT
3 2013-01-01 00:00:00  0.480578 NaT
4 2013-01-01 00:00:00  0.464771 NaT

[5 rows x 3 columns]

In [9]: df.dtypes
Out[9]: 
A    datetime64[ns]
B           float64
C    datetime64[ns]
dtype: object

convert_objects 不会强制将列转换为日期时间,除非该列至少具有1个非nan是日期(表示示例失败的原因)。 to_datetime 可能更具攻击性,因为它知道您确实要进行转换。

convert_objects won't forcibly convert a column to datetime unless it has a least 1 non-nan thing that is a date (that why your example fails). to_datetime can be more aggressive because it 'knows' that you really want to convert it.

这篇关于将 pandas 列转换为包括缺少值的datetime64的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆