迭代 pandas 数据框,检查值并创建其中的一些 [英] Iterating pandas dataframe, checking values and creating some of them

查看:48
本文介绍了迭代 pandas 数据框,检查值并创建其中的一些的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好吧,我有一个(大)数据框,如下所示:

Ok, I have a (big) dataframe, something like this:

         date       time      value
0     20100201         0         1
1     20100201         6         2
2     20100201        12         3
3     20100201        18         4
4     20100202         0         5
5     20100202         6         6
6     20100202        12         7
7     20100202        18         8
8     20100203         0         9
9     20100203        18        11
10    20100204         6        12
...
8845  20160101        18      8846  

如您所见,数据框有一个列date,一个列time(每天有四个小时(00、06、12、18))和一个列value.

As you can see, the dataframe has a column date, a column time with four hours for each day (00, 06, 12, 18) and a column value.

问题在于数据框中缺少日期,在上面的示例中,第8行和第9行之间应该有两个额外的行,分别对应于一天20100203的小时612,并且在第9行和第10行之间还有一个额外的行,对应于一天20100204的小时0.

The problem is that there are missing dates in the dataframe, in the example above there should be two extra rows between rows 8 and 9, corresponding to the hours 6 and 12 of the day 20100203, and also an extra row between rows 9 and 10 corresponding to the hour 0 of the day 20100204.

我需要什么?我想迭代数据框的date列,检查每一天是否存在,没有人丢失,并且每一天都有四个小时(00、06、12、18).如果在迭代过程中缺少某些内容,则应在恰好中添加该位置,并以丢失的datetimeNaN作为值.为了不再次复制所有数据框,让我输入最终版本中应该出现的相关方面:

What would I need? I would like to iterate the date column of the dataframe, checking that every day exists and no one is missing, and also that for every day there are the four hours (00, 06, 12, 18). In case that something is missing during the iteration there should be added in exactly that place, with the missing date and time and NaN as a value. In order to not copy all the dataframe again, let me put the relevant aspects that there should appear in a final version:

...
7     20100202        18         8
8     20100203         0         9
9     20100203         6       NaN
10    20100203        12       NaN   
11    20100203        18        11
12    20100204         0       NaN
13    20100204         6        12
...

如果您有兴趣,请在此处询问此问题的更简单版本

In case you are interested, an easier version of this problem was asked here Modular arithmetic in python to iterate a pandas dataframe and kindly answered by users @Alexander and @piRSquared. The version asked here is a more difficult one, involving (I suppose) the use of datetime and timedelta and iterating more columns.

很抱歉,很长的帖子,非常感谢.

Sorry for the long post and thank you very much.

推荐答案

您可以使用

You can use pivot for reshaping - you get NaN in missing values by column time, then unstack with reset_index and sort_values:

import pandas as pd

df = pd.DataFrame({'date': {0: 20100201, 1: 20100201, 2: 20100201, 3: 20100201, 4: 20100202, 5: 20100202, 6: 20100202, 7: 20100202, 8: 20100203, 9: 20100203, 10: 20100204}, 
                   'time': {0: 0, 1: 6, 2: 12, 3: 18, 4: 0, 5: 6, 6: 12, 7: 18, 8: 0, 9: 18, 10: 6},
                   'value': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 11, 10: 12}})

print (df)
        date  time  value
0   20100201     0      1
1   20100201     6      2
2   20100201    12      3
3   20100201    18      4
4   20100202     0      5
5   20100202     6      6
6   20100202    12      7
7   20100202    18      8
8   20100203     0      9
9   20100203    18     11
10  20100204     6     12

print (df.pivot(index='date', columns='time', values='value')
         .unstack()
         .reset_index(name='value')
         .sort_values('date'))

    time      date  value
0      0  20100201    1.0
4      6  20100201    2.0
8     12  20100201    3.0
12    18  20100201    4.0
1      0  20100202    5.0
5      6  20100202    6.0
9     12  20100202    7.0
13    18  20100202    8.0
2      0  20100203    9.0
6      6  20100203    NaN
10    12  20100203    NaN
14    18  20100203   11.0
3      0  20100204    NaN
7      6  20100204   12.0
11    12  20100204    NaN
15    18  20100204    NaN

也许您可以 reset_index 再一次,如果您需要像index这样的漂亮:

Maybe you can reset_index again, if you need nice index like:

print (df.pivot(index='date', columns='time', values='value')
         .unstack()
         .reset_index(name='value')
         .sort_values('date')
         .reset_index(drop=True))

    time      date  value
0      0  20100201    1.0
1      6  20100201    2.0
2     12  20100201    3.0
3     18  20100201    4.0
4      0  20100202    5.0
5      6  20100202    6.0
6     12  20100202    7.0
7     18  20100202    8.0
8      0  20100203    9.0
9      6  20100203    NaN
10    12  20100203    NaN
11    18  20100203   11.0
12     0  20100204    NaN
13     6  20100204   12.0
14    12  20100204    NaN
15    18  20100204    NaN

这篇关于迭代 pandas 数据框,检查值并创建其中的一些的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆