多索引数据框 pandas 中的操作 [英] Operations in multi index dataframe pandas

查看:53
本文介绍了多索引数据框 pandas 中的操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要处理来自大数据csv的地理和统计数据.它包含来自地理行政和地统计的数据.自治区,区位,地统计基本分区和地块构成了层次索引.

I need to process geographic and statistical data from a big data csv. It contains data from geographical administrative and geostatistical. Municipality, Location, geostatistical basic division and block constitute the hierarchical indexes.

我必须为每个元素创建一个新列['data2'],即地理索引中数据的最大值,然后将每个块值除以该值.对于每个索引级别,索引级别值必须不同于0,因为0索引级别值说明了计算中未使用的其他类型的信息.

I have to create a new column ['data2'] for every element the max value of the data in the geo index, and divide each block value by that value. For each index level, and the index level value must be different from 0, because the 0 index level value accounts for other types of info not used in the calculation.

                       data1  data2
mun  loc  geo  block
1    0    0    0       20     20
1    1    0    0       10     10
1    1    1    0       10     10   
1    1    1    1       3      3/4
1    1    1    2       4      4/4
1    1    2    0       30     30   
1    1    2    1       1      1/3
1    1    2    2       3      3/3
1    2    1    1       10     10/12
1    2    1    2       12     12/12
2    1    1    1       123    123/123
2    1    1    2       7      7/123
2    1    2    1       6      6/6
2    1    2    2       1      1/6

有什么想法吗?我已经尝试过for循环,使用reset_index()转换列中的索引,并按列和行值进行迭代,但是计算将永远耗费时间,我认为这不是进行此类操作的正确方法.

Any ideas? I have tried with for loops, converting the indexes in columns with reset_index() and iterating by column and row values but the computation is taking forever and I think that is not the correct way to do this kind of operations.

此外,如果我想得到这样的蒙版怎么办,这样我就可以将计算运用于各个层次.

Also, what if I want to get my masks like this, so I can run my calculations to every level.

mun  loc  geo  block
1    0    0    0     False       
1    1    0    0     False       
1    1    1    0     True          
1    1    1    1     False        
1    1    1    2     False        
1    1    2    0     True          
1    1    2    1     False        
1    1    2    2     False        

mun  loc  geo  block
1    0    0    0     False       
1    1    0    0     True       
1    1    1    0     False          
1    1    1    1     False        
1    1    1    2     False
1    2    0    0     True
1    2    2    0     False          
1    2    2    1     False        

mun  loc  geo  block
1    0    0    0     True       
1    1    0    0     False       
1    1    1    0     False          
1    1    1    1     False        
1    1    1    2     False
2    0    0    0     True
2    1    1    0     False          
2    1    2    1     False   

推荐答案

您可以先从mask. MultiIndex.html"rel =" nofollow> MultiIndex ,与0进行比较,并通过

You can first create mask from MultiIndex, compare with 0 and check at least one True (at least one 0) by any:

mask = (pd.DataFrame(df.index.values.tolist(), index=df.index) == 0).any(axis=1)
print (mask)
mun  loc  geo  block
1    0    0    0         True
     1    0    0         True
          1    0         True
               1        False
               2        False
          2    0         True
               1        False
               2        False
     2    1    1        False
               2        False
2    1    1    1        False
               2        False
          2    1        False
               2        False
dtype: bool

然后获取 max 值,每个 groupby 第一个,第二个和第三个索引,但在通过 boolean indexing 仅在mask中不是True的值:

Then get max values by groupby per first, second and third index, but before filter by boolean indexing only values where are not True in mask:

df1 = df.ix[~mask, 'data1'].groupby(level=['mun','loc','geo']).max()
print (df1)
mun  loc  geo
1    1    1        4
          2        3
     2    1       12
2    1    1      123
          2        6

然后 reindex 通过df.index,通过Multiindex的最后一级> reset_index mask 值,其中mask不变(也必须删除最后一个级别)和

Then reindex df1 by df.index, remove last level of Multiindex by reset_index, mask values where no change by mask (also is necessary remove last level) and fillna by 1, because dividing return same value.

df1 = df1.reindex(df.reset_index(level=3, drop=True).index)
         .mask(mask.reset_index(level=3, drop=True)).fillna(1)
print (df1)
Name: data1, dtype: int64
mun  loc  geo
1    0    0        1.0
     1    0        1.0
          1        1.0
          1        4.0
          1        4.0
          2        1.0
          2        3.0
          2        3.0
     2    1       12.0
          1       12.0
2    1    1      123.0
          1      123.0
          2        6.0
          2        6.0
Name: data1, dtype: float64

最后除以 div :

Last divide by div:

print (df['data1'].div(df1.values,axis=0))
mun  loc  geo  block
1    0    0    0        20.000000
     1    0    0        10.000000
          1    0        10.000000
               1         0.750000
               2         1.000000
          2    0        30.000000
               1         0.333333
               2         1.000000
     2    1    1         0.833333
               2         1.000000
2    1    1    1         1.000000
               2         0.056911
          2    1         1.000000
               2         0.166667
dtype: float64

这篇关于多索引数据框 pandas 中的操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆