pandas:根据一列中的值计算出的列 [英] pandas: Calculated column based on values in one column
问题描述
我在csv文件中有这样的列(我使用 read_csv('fileA.csv',parse_dates = ['ProcessA_Timestamp'])
加载
I have columns like this in a csv file (I load it using read_csv('fileA.csv', parse_dates=['ProcessA_Timestamp'])
)
Item ProcessA_Timestamp
'A' 2014-06-08 03:32:20
'B' 2014-06-08 03:32:20
'A' 2014-06-08 03:33:19
'C' 2014-06-08 03:33:20
'B' 2014-06-08 03:33:40
'D' 2014-06-08 03:38:20
我将如何创建名为 ProcessA_ProcessingTime
的列,这将是上次时间与表<$ c $中出现时间之间的时间差c>- 第一次出现在表格中。
How would I go about creating a column called ProcessA_ProcessingTime
, which would be the time difference between last time an item occurs in the table -
first time it occurs in the table.
类似地,我还有其他数据框( (不确定是否应将它们合并到一个数据帧中。)具有自己的 Process * _Timestamp
s的数据框。
Similarly, I have other data frames (which I'm not sure if they should be merged into one dataframe).. that have their own Process*_Timestamp
s.
最后,我需要创建一个表,其数据如下所示:
Finally, I need to create a table, where the data is like this:
Item ProcessA_ProcessingTime ProcessB_ProcessingTime ... ProcessX_ProcessingTime
'A' 00:00:59 ...
'B' 00:01:21
'C' NOT FINISHED YET
'D' NOT FINISHED YET
推荐答案
使用pandas groupby-apply组合。按项目对数据框进行分组,并应用一个函数来计算处理时间。
You can use the pandas groupby-apply combo. Group the dataframe by "Item" and apply a function that calculates the process time. Something like:
import pandas as pd
def calc_process_time(row):
ts = row["ProcessA_Timestamp].values
if len(ts) == 1:
return pd.NaT
else:
return ts[-1] - ts[0] #last time - first time
df.groupby("Item").apply(calc_process_time)
这篇关于pandas:根据一列中的值计算出的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!