加入复杂的 pandas 桌 [英] To join complicated pandas tables
问题描述
我正在尝试将来自statsmodels GLM的结果数据框连接到旨在在迭代模型时同时保存单变量数据和模型结果的数据框.我在弄清楚如何在语法上将两个数据集结合起来时遇到了麻烦.
I'm trying to join a dataframe of results from a statsmodels GLM to a dataframe designed to hold both univariate data and model results as models are iterated through. i'm having trouble figuring out how to grammatically join the two data sets.
我已经查阅了下面列出的熊猫文档,但是没有运气:
I've consulted the pandas documentation found below to no luck:
这是困难的,因为与最终表相比,模型的输出更高,后者包含每个唯一变量的每个唯一级别的值.
This is difficult because of the out put of the model compared to the final table which holds values of each unique level of each unique variable.
使用以下代码查看数据外观示例:
See an example of what the data looks like with the code below:
import pandas as pd
df = {'variable': ['CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model'
,'channel_model','channel_model','channel_model']
, 'level': [0,100,200,250,500,750,1000, 'DIR', 'EA', 'IA']
,'value': [460955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36
,36388470.44,31805316.37]}
final_table = pd.DataFrame(df)
df2 = {'variable': ['intercept','C(channel_model)[T.EA]','C(channel_model)[T.IA]', 'CLded_model']
, 'coefficient': [-2.36E-14,-0.091195797,-0.244225888, 0.00174356]}
model_results = pd.DataFrame(df2)
运行此命令后,您可以看到对于分类变量,与final_table相比,该值在几层中进行了封装.诸如CLded_model
之类的数值需要与与其关联的一个系数结合在一起.
After this is run you can see that for categorical variables, the value is incased in a few layers compared to the final_table. Numerical values such as CLded_model
needs to be joined with the one coefficient it's associated with.
这有很多,我不确定从哪里开始.
There is a lot to this and i'm not sure where to start.
更新:以下代码可产生所需的结果:
Update: The following code produces the desired result:
d3 = {'variable': ['intercept', 'CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model'
,'CLded_model','channel_model','channel_model','channel_model']
, 'level': [None, 0,100,200,250,500,750,1000, 'DIR', 'EA', 'IA']
,'value': [None, 60955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36
,36388470.44,31805316.37]
, 'coefficient': [ -2.36E-14, 0.00174356, 0.00174356, 0.00174356, 0.00174356, 0.00174356 ,0.00174356
, 0.00174356,None, -0.091195797,-0.244225888, ]}
desired_result = pd.DataFrame(d3)
推荐答案
首先,您必须清理df2:
First you have to clean df2:
df2['variable'] = df2['variable'].str.replace("C\(","")\
.str.replace("\)\[T.", "-")\
.str.strip("\]")
df2
variable coefficient
0 intercept -2.360000e-14
1 channel_model-EA -9.119580e-02
2 channel_model-IA -2.442259e-01
3 CLded_model 1.743560e-03
由于您要合并级别列上的某些df1而不合并其他内容,因此我们需要稍微更改df1以匹配df2:
Because you want to merge some of df1 on the level column and others not, we need to change df1 slightly to match df2:
df1.loc[df1['variable'] == 'channel_model', 'variable'] = "channel_model-"+df1.loc[df1['variable'] == 'channel_model', 'level']
df1
#snippet of what changed
variable level value
6 CLded_model 1000 1.458799e+07
7 channel_model-DIR DIR 1.049374e+07
8 channel_model-EA EA 3.638847e+07
9 channel_model-IA IA 3.180532e+07
然后我们将它们合并:
df4 = df1.merge(df2, how = 'outer', left_on =['variable'], right_on = ['variable'])
我们会得到您的结果(变量名中的微小变化除外)
And we get your result (except for the minor change in the variable name)
这篇关于加入复杂的 pandas 桌的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!