pandas 中各列的曲线拟合+外推值 [英] Curve fitting for each column in Pandas + extrapolate values

查看:166
本文介绍了 pandas 中各列的曲线拟合+外推值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含300列的数据集,每列都与深度有关. Pandas DataFrame的简化版本如下所示:

I have a data set with some 300 columns, each of them depth-dependent. The simplified version of the Pandas DataFrame would look something like this:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy_optimize import curve_fit

df1 = pd.DataFrame({'depth': [1.65, 2.15, 2.65, 3.15, 3.65, 4.15, 4.65, 5.15, 5.65, 6.15, 6.65, 7.15, 7.65, 8.15, 8.65],
               '400.0': [13.909261, 7.758734, 3.513627, 2.095409, 1.628918, 0.782643, 0.278548, 0.160153, -0.155895, -0.152373, -0.147820, -0.023997, 0.010729, 0.006050, 0.002356],
               '401.0': [14.581624, 8.173803, 3.757856, 2.223524, 1.695623, 0.818065, 0.300235, 0.173674, -0.145402, -0.144456, -0.142969, -0.022471, 0.010802, 0.006181, 0.002641],
               '402.0': [15.253988, 8.588872, 4.002085, 2.351638, 1.762327, 0.853486, 0.321922, 0.187195, -0.134910, -0.136539, -0.138118, -0.020945, 0.010875, 0.006313, 0.002927],
               '403.0': [15.633908, 8.833914, 4.146499, 2.431543, 1.798185, 0.874350, 0.333470, 0.192128, -0.130119, -0.134795, -0.136049, -0.019307, 0.012037, 0.006674, 0.003002],
               '404.0': [15.991816, 9.066159, 4.283401, 2.507818, 1.831721, 0.894119, 0.344256, 0.196415, -0.125758, -0.133516  , -0.134189, -0.017659, -0.013281,0.007053, 0.003061],
               '405.0': [16.349725, 9.298403, 4.420303, 2.584094, 1.865257, 0.913887, 0.355041, 0.200702, -0.121396, -0.132237, -0.132330, -0.016012, 0.014525, 0.007433, 0.003120]
               })

我需要做的是估算以下等式中的 K .基本上,每一列都对应一个 I(z)配置文件.我要使用curve_fit作为计算的 I(0),作为参考,我正在使用以下有用的文章:

What I need to do is to estimate the K in the equation below. Basically each column corresponds to a I(z) profile. The I(0) has to be calculated, for which I used the curve_fit, as a reference I'm using this helpful post: https://stackoverflow.com/a/15369787/7541421

x = df1.depth       # Column values as a function of depth
y = df1['400.0']

plt.plot(x, y, 'ro',label="Original Data")

def func(def func(x, I0, k):     # a = I0, b = k
    return I0 * np.exp(-k*x)    

popt, pcov = curve_fit(func, x, y)

print ("E0 = %s , k = %s" % (popt[0], popt[1]))

plt.plot(x, func(x, *popt), label="Fitted Curve")

是否可以分别为每个列完成,并以某种方式保存作为新的DataFrame ?

Could this be done for each column separately and somehow saved as a new DataFrame?

此外,对于某些dz配额,需要将新的DataFrame传播到z=0的值.在这种情况下,我的depth列中缺少[0.15、0.65、1.15]. 因此,对于每个z,我需要从函数的每一列中获取I(z).

Also, the new DataFrame needs to be propagated to the values towards z=0 for certain dz quotas. In this case I'm missing [0.15, 0.65, 1.15] in my depth column. So for every z I need to get per each column the I(z) from the function.

由于每个数据集在我的情况下都有不同的深度范围,我该如何使其自动化?

How can I automatize it since every data set has a different depth range in my case?

P.S.另外,正如本文最初讨论的那样,可以应用对数转换的线性回归拟合,其解决方案写在下面的答案中.

P.S. Alternatively, as it has been originally discussed in this post, a log-transfored linear regression fit can be applied, for which the solution is written in an answer below.

推荐答案

与该答案的主要作者并获得他的同意后,进行了一些更改.

Some changes have been made after the conversation with the principal author of this answer and with his approval.

首先,由于我们要处理对数转换量,因此有必要找到与每列非负值相对应的值范围.

First of all, since we are dealing with log-transform quantities, it is necessary to find the range of values which correspond to non-negative values per column.

negative_idx_aux = df_drop_depth.apply(lambda x:(x<0).nonzero()[0][:1].tolist())   
negative_idx = [item for sublist in negative_idx_aux for item in sublist]

if len(negative_idx) > 0:
    max_idx = max_idx = np.min(negative_idx)
else:
    max_idx = None

与原始文件相比,我仅合并了循环以获取斜率和截距.

Compared to the original, I only merge the loops to obtain both the slope and intercept.

iz_cols = df1.columns.difference(['depth'])
slp_int = {}
for c in iz_cols:
    slope, intercept, r_value, p_value, std_err = stats.linregress(df1['depth'][0:max_idx],np.log(df1[c][0:max_idx]))
    slp_int[c] = [intercept, slope]

slp_int = pd.DataFrame(, index = ['intercept', 'slope'])

指数截距为我们提供了表面的I值:

Exponentiating intercept gives us the value of I at the surface:

slp_int.loc['intercept'] = np.exp(slp_int.loc['intercept'])

该帖子的最后一部分由于对最终概念的误解而已得到更正. 现在,将重新创建数据框,并使用新的表面深度值(在df1的深度范围内,将df1的值保持在下面.

The last part of the post has been corrected due to a misunderstanding of the final concept. The dataframe is now recreated, with new values for the surface depths (above the depth range of df1, keeping the df1 for values below.

首先在z = 0和深度列的最大值之间重新创建一个整个范围,并为其分配一个step并将其值保持在z = 0:

First a whole range between z = 0 and the maximum value of the depth column is recreated, with an assigned step plus keeping the value at z = 0:

depth = np.asarray(df1.depth)
depth_min = np.min(depth)    ;   
depth_min_arr = np.array([depth_min])
step = 0.5
missing_vals_aux = np.arange(depth_min - step, 0, -step)[::-1]
missing_vals = np.concatenate(([0.], missing_vals_aux), axis=0)
depth_tot = np.concatenate((missing_vals, depth), axis=0)

df_boundary = pd.DataFrame(columns = iz_cols) 
df_up = pd.DataFrame(columns = iz_cols) 

使用向上传播的深度配额范围创建一个数据框:

Create a dataframe with the range of the upward-propagated depth quotas:

for c in iz_cols: 
    df_up[c]       = missing_vals

使用回归获得的参数填充数据:

Fill the data with the regression-obtained parameters:

upper_df = slp_int.loc['intercept']*np.exp(slp_int.loc['slope']*df_up)
upper_df['depth'] = missing_vals

合并df1和upper_df以获得完整的配置文件:

Merge the df1 and the upper_df to obtain a whole profile:

lower_df = df1
lower_df['depth'] = depth

df_profile_tot = upper_df.append(lower_df, ignore_index=True)

这篇关于 pandas 中各列的曲线拟合+外推值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆