将 pandas 列转换为包括缺少值的datetime64 [英] Converting pandas columns to datetime64 including missing values
问题描述
与熊猫一起使用某些基于时间序列的数据,其中包含日期,数字,类别等。
Working with Pandas to work with some timeseries based data that contains dates, numbers, categories etc.
我遇到的问题是让熊猫来处理我的问题从CSV创建的DataFrame中正确显示日期/时间列。我的数据中有18个日期列,它们不是连续的,原始CSV中的未知值的字符串值为未知。有些列中的所有单元格都具有有效的日期时间,并且可以通过pandas read_csv方法正确猜出其dtype。但是,有些列中的特定数据样本中的所有单元格都为未知,并且将其键入为对象。
The problem I'm having is getting pandas to deal with my date/time columns correctly from a DataFrame created from a CSV. There are 18 date columns in my data, they are not continuous and unknown values in the raw CSV have a string value of "Unknown". Some columns have ALL cells with a valid datetime in it and correctly get their dtype guessed by the pandas read_csv method. There are some columns however that in a particular data sample have ALL cells as "Unknown" and these get typed as object.
我加载CSV的代码如下:
My code to load the CSV is as follows:
self.datecols = ['Claim Date', 'Lock Date', 'Closed Date', 'Service Date', 'Latest_Submission', 'Statement Date 1', 'Statement Date 2', 'Statement Date 3', 'Patient Payment Date 1', 'Patient Payment Date 2', 'Patient Payment Date 3', 'Primary 1 Payment Date', 'Primary 2 Payment Date', 'Primary 3 Payment Date', 'Secondary 1 Payment Date', 'Secondary 2 Payment Date', 'Tertiary Payment Date']
self.csvbear = pd.read_csv(file_path, index_col="Claim ID", parse_dates=True, na_values=['Unknown'])
self.csvbear = pd.DataFrame.convert_objects(self.csvbear, convert_dates='coerce')
print self.csvbear.dtypes
print self.csvbear['Tertiary Payment Date'].values
print的输出self.csvbear.dtypes
The output from print self.csvbear.dtypes
Prac object
Doctor Name object
Practice Name object
Specialty object
Speciality Code int64
Claim Date datetime64[ns]
Lock Date datetime64[ns]
Progress Note Locked object
Aging by Claim Date int64
Aging by Lock Date int64
Closed Date datetime64[ns]
Service Date datetime64[ns]
Week Number int64
Month datetime64[ns]
Current Insurance object
...
Secondary 2 Deductible float64
Secondary 2 Co Insurance float64
Secondary 2 Member Balance float64
Secondary 2 Paid float64
Secondary 2 Witheld float64
Secondary 2 Ins object
Tertiary Payment Date object
Tertiary Payment ID float64
Tertiary Allowed float64
Tertiary Deductible float64
Tertiary Co Insurance float64
Tertiary Member Balance float64
Tertiary Paid float64
Tertiary Witheld float64
Tertiary Ins float64
Length: 96, dtype: object
[nan nan nan ..., nan nan nan]
Press any key to continue . . .
您可以看到,第三次付款日期col应该是datetime64 dtype,但这只是一个对象,其实际内容仅为NaN(从read_csv函数将其放入字符串'Unknown')。
As you can see, the Tertiary Payment Date col should be a datetime64 dtype, but it's simply a object, and the actual content of it is just NaN (put there from the read_csv function for string 'Unknown').
如何可靠地将所有日期列转换为以datetime64作为dtype和未知单元格的NaT?
How can I reliably convert all of the date columns to have datetime64 as a dtype and NaT for 'Unknown' cells?
推荐答案
如果您有一个全南的列,则 read_csv
不会正确地强制它。最简单的方法就是执行此操作(如果已存在datetime64 [ns]的列将直接通过)。
if you have an all-nan column it won't be coerced properly by read_csv
. easiest is just to do this (which if a column is already datetime64[ns] will just pass thru).
In [3]: df = DataFrame(dict(A = Timestamp('20130101'), B = np.random.randn(5), C = np.nan))
In [4]: df
Out[4]:
A B C
0 2013-01-01 00:00:00 -0.859994 NaN
1 2013-01-01 00:00:00 -2.562136 NaN
2 2013-01-01 00:00:00 0.410673 NaN
3 2013-01-01 00:00:00 0.480578 NaN
4 2013-01-01 00:00:00 0.464771 NaN
[5 rows x 3 columns]
In [5]: df.dtypes
Out[5]:
A datetime64[ns]
B float64
C float64
dtype: object
In [6]: df['A'] = pd.to_datetime(df['A'])
In [7]: df['C'] = pd.to_datetime(df['C'])
In [8]: df
Out[8]:
A B C
0 2013-01-01 00:00:00 -0.859994 NaT
1 2013-01-01 00:00:00 -2.562136 NaT
2 2013-01-01 00:00:00 0.410673 NaT
3 2013-01-01 00:00:00 0.480578 NaT
4 2013-01-01 00:00:00 0.464771 NaT
[5 rows x 3 columns]
In [9]: df.dtypes
Out[9]:
A datetime64[ns]
B float64
C datetime64[ns]
dtype: object
convert_objects
不会强制将列转换为日期时间,除非该列至少具有1个非nan是日期(表示示例失败的原因)。 to_datetime
可能更具攻击性,因为它知道您确实要进行转换。
convert_objects
won't forcibly convert a column to datetime unless it has a least 1 non-nan thing that is a date (that why your example fails). to_datetime
can be more aggressive because it 'knows' that you really want to convert it.
这篇关于将 pandas 列转换为包括缺少值的datetime64的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!