多索引数据框 pandas 中的操作 [英] Operations in multi index dataframe pandas
问题描述
我需要处理来自大数据csv的地理和统计数据.它包含来自地理行政和地统计的数据.自治区,区位,地统计基本分区和地块构成了层次索引.
I need to process geographic and statistical data from a big data csv. It contains data from geographical administrative and geostatistical. Municipality, Location, geostatistical basic division and block constitute the hierarchical indexes.
我必须为每个元素创建一个新列['data2'],即地理索引中数据的最大值,然后将每个块值除以该值.对于每个索引级别,索引级别值必须不同于0,因为0索引级别值说明了计算中未使用的其他类型的信息.
I have to create a new column ['data2'] for every element the max value of the data in the geo index, and divide each block value by that value. For each index level, and the index level value must be different from 0, because the 0 index level value accounts for other types of info not used in the calculation.
data1 data2
mun loc geo block
1 0 0 0 20 20
1 1 0 0 10 10
1 1 1 0 10 10
1 1 1 1 3 3/4
1 1 1 2 4 4/4
1 1 2 0 30 30
1 1 2 1 1 1/3
1 1 2 2 3 3/3
1 2 1 1 10 10/12
1 2 1 2 12 12/12
2 1 1 1 123 123/123
2 1 1 2 7 7/123
2 1 2 1 6 6/6
2 1 2 2 1 1/6
有什么想法吗?我已经尝试过for循环,使用reset_index()转换列中的索引,并按列和行值进行迭代,但是计算将永远耗费时间,我认为这不是进行此类操作的正确方法.
Any ideas? I have tried with for loops, converting the indexes in columns with reset_index() and iterating by column and row values but the computation is taking forever and I think that is not the correct way to do this kind of operations.
此外,如果我想得到这样的蒙版怎么办,这样我就可以将计算运用于各个层次.
Also, what if I want to get my masks like this, so I can run my calculations to every level.
mun loc geo block
1 0 0 0 False
1 1 0 0 False
1 1 1 0 True
1 1 1 1 False
1 1 1 2 False
1 1 2 0 True
1 1 2 1 False
1 1 2 2 False
mun loc geo block
1 0 0 0 False
1 1 0 0 True
1 1 1 0 False
1 1 1 1 False
1 1 1 2 False
1 2 0 0 True
1 2 2 0 False
1 2 2 1 False
mun loc geo block
1 0 0 0 True
1 1 0 0 False
1 1 1 0 False
1 1 1 1 False
1 1 1 2 False
2 0 0 0 True
2 1 1 0 False
2 1 2 1 False
推荐答案
您可以先从MultiIndex
,与0
进行比较,并通过
You can first create mask
from MultiIndex
, compare with 0
and check at least one True
(at least one 0
) by any
:
mask = (pd.DataFrame(df.index.values.tolist(), index=df.index) == 0).any(axis=1)
print (mask)
mun loc geo block
1 0 0 0 True
1 0 0 True
1 0 True
1 False
2 False
2 0 True
1 False
2 False
2 1 1 False
2 False
2 1 1 1 False
2 False
2 1 False
2 False
dtype: bool
然后获取 max
值,每个 groupby
第一个,第二个和第三个索引,但在通过 boolean indexing
仅在mask
中不是True
的值:
Then get max
values by groupby
per first, second and third index, but before filter by boolean indexing
only values where are not True
in mask
:
df1 = df.ix[~mask, 'data1'].groupby(level=['mun','loc','geo']).max()
print (df1)
mun loc geo
1 1 1 4
2 3
2 1 12
2 1 1 123
2 6
然后 reindex
df.index
,通过Multiindex的最后一级> reset_index
, mask
值,其中mask
不变(也必须删除最后一个级别)和
Then reindex
df1
by df.index
, remove last level of Multiindex
by reset_index
, mask
values where no change by mask
(also is necessary remove last level) and fillna
by 1
, because dividing return same value.
df1 = df1.reindex(df.reset_index(level=3, drop=True).index)
.mask(mask.reset_index(level=3, drop=True)).fillna(1)
print (df1)
Name: data1, dtype: int64
mun loc geo
1 0 0 1.0
1 0 1.0
1 1.0
1 4.0
1 4.0
2 1.0
2 3.0
2 3.0
2 1 12.0
1 12.0
2 1 1 123.0
1 123.0
2 6.0
2 6.0
Name: data1, dtype: float64
最后除以 div
:
Last divide by div
:
print (df['data1'].div(df1.values,axis=0))
mun loc geo block
1 0 0 0 20.000000
1 0 0 10.000000
1 0 10.000000
1 0.750000
2 1.000000
2 0 30.000000
1 0.333333
2 1.000000
2 1 1 0.833333
2 1.000000
2 1 1 1 1.000000
2 0.056911
2 1 1.000000
2 0.166667
dtype: float64
这篇关于多索引数据框 pandas 中的操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!