如何检查错误的日期时间条目(python/pandas)? [英] How to check for wrong datetime entries (python/pandas)?

查看:29
本文介绍了如何检查错误的日期时间条目(python/pandas)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个excel数据集,其中包含员工输入的工作时间的日期时间值.既然年底将至,他们想对此进行报告,但是其中充满了错误的输入.因此,我需要清洁它.

I have an excel dataset containing datetime values of worked hours entered by employees. Now that the end of the year is near they want to report on it, however it is full of wrong entries. Thus I need to clean it.

下面是一些错误输入的示例.

Herebelow some examples of wrong entries.

面对此类数据集时,您将采取什么方法?

What would be your approach when facing such datasets?

我首先使用 df ['Shiftdatum'] = pd.to_datetime(df.Shiftdatum,format ='%Y-%m-%d',errors ='coerce')将日期列转换为日期时间

在下面的示例数据中显示了NaT

In below's sampledata it shows a NaT

如何过滤掉这些NaT,包括行的索引?

How do I filter out these NaT's including the row's index?

[Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 Timestamp('2019-03-11 00:00:00'),
 NaT,
 Timestamp('2019-03-12 00:00:00')

初始样本数据:

{0: '2019-03-11 00:00:00',
 1: '2019-03-11 00:00:00',
 2: '2019-03-11 00:00:00',
 3: '2019-03-11 00:00:00',
 4: '2019-03-11 00:00:00',
 5: '2019-03-11 00:00:00',
 6: '2019-03-11 00:00:00',
 7: '2019-03-11 00:00:00',
 8: '2019-03-11 00:00:00',
 9: '2019-03-11 00:00:00',
 10: '2019-03-11 00:00:00',
 11: '2019-03-11 00:00:00',
 12: '2019-03-11 00:00:00',
 13: '2019-03-11 00:00:00',
 14: '2019-03-11 00:00:00',
 15: '2019-03-11 00:00:00',
 16: '33/11/2019',
 17: '2019-03-12 00:00:00',
 18: '2019-03-12 00:00:00',
 19: '2019-03-12 00:00:00'}

推荐答案

IIUC,

您可以通过多种方式处理此问题,可以使用 pd.to_datetime(column,errors ='coerce')并将数据分配到新列

you could handle this in a number of ways, you could use pd.to_datetime(column,errors='coerce') and assign your data to a new column

然后使用新列,您可以按 NaT 进行过滤,并获得唯一的离群值,

then with the new column, you could filter by NaT and get the unique outliers,

让我们说这是结果:

data = ['033-10-2019', '100-03-2019','1003-03-2019','03-10-2019']

df = pd.DataFrame({'date_time' : data})
df['correct'] = pd.to_datetime(df['date_time'],errors='coerce')
print(df)
       date_time    correct
0   033-10-2019        NaT
1   100-03-2019        NaT
2  1003-03-2019        NaT
3    03-10-2019 2019-03-10

现在-我们需要在 date_time col

now - we need to grab the unique NaT values in the date_time col

errors = df.loc[df['correct'].isnull()]['date_time'].unique().tolist()
out : ['033-10-2019', '100-03-2019', '1003-03-2019']

这很无聊,您需要仔细研究并纠正错误,然后将正确的值传递给字典:

this is the boring bit, you'll need to go through and fix the errors and pass the correct value into a dictionary :

correct_dict = {'033-10-2019' : '03-10-2019', '100-03-2019' : '03-10-2019', '1003-03-2019' : '10-03-2019'}

然后将值映射回您的数据框:

then map the values back into your dataframe :

df['correct'] = df['correct'].fillna(pd.to_datetime(df['date_time'].map(correct_dict)))
print(df)
      date_time    correct
0   033-10-2019 2019-03-10
1   100-03-2019 2019-03-10
2  1003-03-2019 2019-10-03
3    03-10-2019 2019-03-10

如果只想删除NaT值,则可以在子列设置子集的同时 dropna

If you just want to remove the NaT values you can just dropna whilst subsetting your column

df = df.dropna(subset=['correct'])

这篇关于如何检查错误的日期时间条目(python/pandas)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆