合并两个具有重叠范围的数据框并计算按类别分组的重叠 [英] Combine two data frames with overlapping ranges and calculate overlap grouped by category

查看:60
本文介绍了合并两个具有重叠范围的数据框并计算按类别分组的重叠的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基于这个答案
我有两个大数据框(100K行),df Assay具有值,df Strat具有类型".我想基于深度将Strat中的类型"分配给Assay中的列.深度以深度从"和深度到"列给出. 类型"也由深度从"和到"定义.但是它们不是相同的间隔.测定深度可能跨越多种Strat类型.

Building on this answer
I have two big dataframes (100K rows), df Assay has values, df Strat has 'Types'. I want to assign a 'Type' from Strat to a column in Assay based on depth. The depths are given as depth 'From' and depth 'To' columns. The 'types' are also defined by depth 'From' and 'To'. BUT they are NOT the same intervals. Assay depths may span multiple Strat types.

我想将Strat的类型"分配给Assay df,如果有多个类型,也尝试捕获该信息.

I want to assign the Strat 'types' to the Assay df, and if there are multiple types, try and capture that information too.

我想遍历数据以为每个HOLE_ID填充类型"列.

I want to loop over the data to populate the Type column for each HOLE_ID.

生成示例数据:

import pandas as pd
import numpy as np
Assay=pd.DataFrame(np.array([['Hole_1',1.0,2.5,0.001],['Hole_1',2.5,5.0,0.005],['Hole_1',5.0,7.0,0.002],['Hole_1',7.0,10.0,0.001],['Hole_2',1.0,3.0,0.001],['Hole_2',3.0,5.0,0.005],['Hole_2',5.0,7.0,0.002],['Hole_2',7.0,10.0,0.001]]),columns=['HOLE_ID','FROM', 'TO', 'val'])

Strat=pd.DataFrame(np.array([['Hole_1',0.0,4.0,'A'],['Hole_1',4.0,5.0,'B'],['Hole_1',5.0,6.5,'C'],['Hole_1',6.5,8.0,'D'],['Hole_1',8.0,10.0,'E'],['Hole_2',0.0,4.0,'A'],['Hole_2',4.0,5.1,'B'],['Hole_2',5.1,6.0,'C'],['Hole_2',6.0,8.0,'D'],['Hole_2',8.0,10.0,'E']]),columns=['HOLE_ID','FROM', 'TO', 'Type'])

Assay

Out[1]: 
  HOLE_ID FROM    TO    val
0  Hole_1  1.0   2.5  0.001
1  Hole_1  2.5   5.0  0.005
2  Hole_1  5.0   7.0  0.002
3  Hole_1  7.0  10.0  0.001
4  Hole_2  1.0   3.0  0.001
5  Hole_2  3.0   5.0  0.005
6  Hole_2  5.0   7.0  0.002
7  Hole_2  7.0  10.0  0.001

Strat

Out[2]: 
  HOLE_ID FROM    TO Type
0  Hole_1  0.0   4.0    A
1  Hole_1  4.0   5.0    B
2  Hole_1  5.0   6.5    C
3  Hole_1  6.5   8.0    D
4  Hole_1  8.0  10.0    E
5  Hole_2  0.0   4.0    A
6  Hole_2  4.0   5.1    B
7  Hole_2  5.1   6.0    C
8  Hole_2  6.0   8.0    D
9  Hole_2  8.0  10.0    E

所需输出示例:

  HOLE_ID FROM    TO    val                Type
0  Hole_1  1.0   2.5  0.001              A 100%
1  Hole_1  2.5     5  0.005         A 60%,B 44%
2  Hole_1  5.0   7.0  0.002        C 80%, D 20%
3  Hole_1  7.0  10.0  0.001        D 30%, E 70%
4  Hole_2  1.0   3.0  0.001              A 100%
5  Hole_2  3.0   5.0  0.005         A 50%, B50%
6  Hole_2  5.0   7.0  0.002  B 5%, C 45%, D 50%
7  Hole_2  7.0  10.0  0.001         D 30% E 70%

我的尝试在下面,但是没有用.我不太擅长循环,我尝试了一些有希望的尝试,但是代码似乎可以永远运行(请注意,我的实际数据集是〜100k行和1500个HOLE_ID,因此可能对我的系统要求很高).

My attempt is below, but doesn't work. I am not very good at loops and I have had a few promising attempts but the code seemed to run forever (note, my actual dataset is ~100k rows and 1500 HOLE_ID's so may be quite demanding on my system).

我添加了np.arange,以便可以使用浮点数(以0.1 m为增量来生成辅助序列),我想我已经计算出百分比,但是我的深度有点儿不足.

I have added np.arange so that I can use floats (0.1 m increments to generate the auxiliary series) and I think I have the percentages calculating but I am a bit out of my depth.

进行一些预处理,以确保仅使用匹配的孔ID(实际数据很大,并且还包含示例数据集中未包含的其他列).

A bit of pre-processing to make sure only matching hole IDs are used (real data is big, and also contains additional cols not included in the example data set.)

assay_Hole_IDs =Assay['HOLE_ID'].unique().tolist()
strat_Hole_IDS =Strat['HOLE_ID'].unique().tolist()

Strat=Strat[Strat['HOLE_ID'].isin(assay_Hole_IDs)]
Assay=Assay[Assay['HOLE_ID'].isin(assay_Hole_IDs)]


assay_Hole_IDs =Assay['HOLE_ID'].unique().tolist()
strat_Hole_IDS =Strat['HOLE_ID'].unique().tolist()

检查是否没有其他值


j=set(assay_Hole_IDs).symmetric_difference(set(strat_Hole_IDS))
print len(j)
j

然后:

all_holes= Strat['HOLE_ID'].unique().tolist()

def getType(row):
for hole in all_holes:
    df=Strat.loc[Strat['HOLE_ID']==hole]

    units = df.set_index('Type').apply(lambda row: pd.Series(
    np.arange(row.FROM, row.TO,0.1)), axis=1).stack()\
    .reset_index(level=1, drop=True)

    gr = units[units.ge(row.FROM) & units.lt(row.TO)].groupby(level=0)
    if gr.ngroups == 1:
        return gr.ngroup().index[0]

    txt = []
    counts = []
    pct=[]
    for key, grp in gr:
        siz = grp.size
        un = 'unit' if siz == 1 else 'units'

        counts.append(float(siz))
    for x in counts:
        p=(float(x)/float(sum(counts))*100)
        pct.append(float(p))
    return pct

然后:

assay['Type'] = assay.groupby('HOLE_ID').apply(getType)

谁能看到为什么它不起作用?

Can anyone see why this isn't working?

推荐答案

def group(df1):
    df2 = Strat[Strat['HOLE_ID']==df1.iloc[0]['HOLE_ID']]
    df1[['FROM','TO']] = df1[['FROM','TO']].astype(float)
    df2[['FROM','TO']] = df2[['FROM','TO']].astype(float)

    temp =  pd.concat([df1[['FROM','TO']],df2[['FROM','TO']]]).unstack().reset_index(drop=True) \
              .drop_duplicates().sort_values().reset_index(drop=True) \
              .to_frame(name='FROM').merge(df2, how='outer').ffill()
   temp['TO'] = temp.shift(-1)['FROM']


    def tobins(x):
        agg = temp[(x.FROM <= temp.FROM) & (temp.FROM < x.TO)].groupby('Type') \
                .apply(lambda y: y['TO'].max() - y['FROM'].min()).reset_index(name='res')
        agg.res = agg.Type + ' ' + (agg.res/agg.res.sum()).map('{:.0%}'.format)
        return '; '.join(agg.res.tolist())

    df1['Type'] = df1.apply(tobins,axis=1)
    return df1

Assay.groupby('HOLE_ID').apply(group)

  HOLE_ID  FROM    TO    val          Type
0  Hole_1   1.0   2.5  0.001        A 100%
1  Hole_1   2.5   5.0  0.005  A 60%; B 40%
2  Hole_1   5.0   7.0  0.002  C 75%; D 25%
3  Hole_1   7.0  10.0  0.001  D 33%; E 67%
4  Hole_2   1.0   3.0  0.001        A 100%
5  Hole_2   3.0   5.0  0.005        B 100%
6  Hole_2   5.0   7.0  0.002  C 75%; D 25%
7  Hole_2   7.0  10.0  0.001  D 33%; E 67%


关键点是用两个表中的所有点FROM和TO来构建临时DataFrame.对于HOLE_ID = 'Hole_1',它看起来如下.现在,我们可以使用(x.FROM <= temp.FROM < x.TO)为化验(x)的每一行获取临时表的这些行,按类型对它们进行分组,计算份额并加入结果格式


The key point is building temp DataFrame with all point FROM and TO from both table. For HOLE_ID = 'Hole_1' it looks as below. Now we can get for each row of Assay (x) those rows of temp table with (x.FROM <= temp.FROM < x.TO), group them by Type, calculate shares and join to result format

   FROM HOLE_ID    TO Type
0   0.0  Hole_1   1.0    A
1   1.0  Hole_1   2.5    A
2   2.5  Hole_1   4.0    A
3   4.0  Hole_1   5.0    B
4   5.0  Hole_1   6.5    C
5   6.5  Hole_1   7.0    D
6   7.0  Hole_1   8.0    D
7   8.0  Hole_1  10.0    E
8  10.0  Hole_1   NaN    E

这篇关于合并两个具有重叠范围的数据框并计算按类别分组的重叠的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆