迭代 pandas 数据框,检查值并创建其中的一些 [英] Iterating pandas dataframe, checking values and creating some of them
问题描述
好吧,我有一个(大)数据框,如下所示:
Ok, I have a (big) dataframe, something like this:
date time value
0 20100201 0 1
1 20100201 6 2
2 20100201 12 3
3 20100201 18 4
4 20100202 0 5
5 20100202 6 6
6 20100202 12 7
7 20100202 18 8
8 20100203 0 9
9 20100203 18 11
10 20100204 6 12
...
8845 20160101 18 8846
如您所见,数据框有一个列date
,一个列time
(每天有四个小时(00、06、12、18))和一个列value
.
As you can see, the dataframe has a column date
, a column time
with four hours for each day (00, 06, 12, 18) and a column value
.
问题在于数据框中缺少日期,在上面的示例中,第8行和第9行之间应该有两个额外的行,分别对应于一天20100203
的小时6
和12
,并且在第9行和第10行之间还有一个额外的行,对应于一天20100204
的小时0
.
The problem is that there are missing dates in the dataframe, in the example above there should be two extra rows between rows 8 and 9, corresponding to the hours 6
and 12
of the day 20100203
, and also an extra row between rows 9 and 10 corresponding to the hour 0
of the day 20100204
.
我需要什么?我想迭代数据框的date
列,检查每一天是否存在,没有人丢失,并且每一天都有四个小时(00、06、12、18).如果在迭代过程中缺少某些内容,则应在恰好中添加该位置,并以丢失的date
和time
和NaN
作为值.为了不再次复制所有数据框,让我输入最终版本中应该出现的相关方面:
What would I need? I would like to iterate the date
column of the dataframe, checking that every day exists and no one is missing, and also that for every day there are the four hours (00, 06, 12, 18). In case that something is missing during the iteration there should be added in exactly that place, with the missing date
and time
and NaN
as a value. In order to not copy all the dataframe again, let me put the relevant aspects that there should appear in a final version:
...
7 20100202 18 8
8 20100203 0 9
9 20100203 6 NaN
10 20100203 12 NaN
11 20100203 18 11
12 20100204 0 NaN
13 20100204 6 12
...
In case you are interested, an easier version of this problem was asked here Modular arithmetic in python to iterate a pandas dataframe and kindly answered by users @Alexander and @piRSquared. The version asked here is a more difficult one, involving (I suppose) the use of datetime and timedelta and iterating more columns.
很抱歉,很长的帖子,非常感谢.
Sorry for the long post and thank you very much.
推荐答案
您可以使用 unstack
与 sort_values
:
You can use pivot
for reshaping - you get NaN
in missing values by column time
, then unstack
with reset_index
and sort_values
:
import pandas as pd
df = pd.DataFrame({'date': {0: 20100201, 1: 20100201, 2: 20100201, 3: 20100201, 4: 20100202, 5: 20100202, 6: 20100202, 7: 20100202, 8: 20100203, 9: 20100203, 10: 20100204},
'time': {0: 0, 1: 6, 2: 12, 3: 18, 4: 0, 5: 6, 6: 12, 7: 18, 8: 0, 9: 18, 10: 6},
'value': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 11, 10: 12}})
print (df)
date time value
0 20100201 0 1
1 20100201 6 2
2 20100201 12 3
3 20100201 18 4
4 20100202 0 5
5 20100202 6 6
6 20100202 12 7
7 20100202 18 8
8 20100203 0 9
9 20100203 18 11
10 20100204 6 12
print (df.pivot(index='date', columns='time', values='value')
.unstack()
.reset_index(name='value')
.sort_values('date'))
time date value
0 0 20100201 1.0
4 6 20100201 2.0
8 12 20100201 3.0
12 18 20100201 4.0
1 0 20100202 5.0
5 6 20100202 6.0
9 12 20100202 7.0
13 18 20100202 8.0
2 0 20100203 9.0
6 6 20100203 NaN
10 12 20100203 NaN
14 18 20100203 11.0
3 0 20100204 NaN
7 6 20100204 12.0
11 12 20100204 NaN
15 18 20100204 NaN
也许您可以 reset_index
再一次,如果您需要像index
这样的漂亮:
Maybe you can reset_index
again, if you need nice index
like:
print (df.pivot(index='date', columns='time', values='value')
.unstack()
.reset_index(name='value')
.sort_values('date')
.reset_index(drop=True))
time date value
0 0 20100201 1.0
1 6 20100201 2.0
2 12 20100201 3.0
3 18 20100201 4.0
4 0 20100202 5.0
5 6 20100202 6.0
6 12 20100202 7.0
7 18 20100202 8.0
8 0 20100203 9.0
9 6 20100203 NaN
10 12 20100203 NaN
11 18 20100203 11.0
12 0 20100204 NaN
13 6 20100204 12.0
14 12 20100204 NaN
15 18 20100204 NaN
这篇关于迭代 pandas 数据框,检查值并创建其中的一些的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!