一次完成多次curve_fit的迭代以实现分段功能 [英] Doing many iterations of curve_fit in one go for piecewise function

查看:109
本文介绍了一次完成多次curve_fit的迭代以实现分段功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试一次执行Scipy的curve_fit的许多迭代,以避免循环并因此提高速度.

I'm trying to perform what are many iterations of Scipy's curve_fit at once in order to avoid loops and therefore increase speed.

这非常类似于此问题,解决了.但是,由于函数是分段的(不连续的),因此该解决方案不适用于此处.

This is very similar to this problem, which was solved. However, the fact that the functions are piece-wise (discontinuous) makes so that that solution isn't applicable here.

请考虑以下示例:

import numpy as np
from numpy import random as rng
from scipy.optimize import curve_fit
rng.seed(0)
N=20
X=np.logspace(-1,1,N)
Y = np.zeros((4, N))
for i in range(0,4):
    b = i+1
    a = b
    print(a,b)
    Y[i] = (X/b)**(-a) #+ 0.01 * rng.randn(6)
    Y[i, X>b] = 1

这将产生以下数组:

您看到的在X==b处是不连续的.我可以迭代地使用curve_fit来检索ab的原始值:

Which as you can see are discontinuous at X==b. I can retrieve the original values of a and b by using curve_fit iteratively:

def plaw(r, a, b):
    """ Theoretical power law for the shape of the normalized conditional density """
    import numpy as np
    return np.piecewise(r, [r < b, r >= b], [lambda x: (x/b)**-a, lambda x: 1])


coeffs=[]
for ix in range(Y.shape[0]):
    print(ix)
    c0, pcov = curve_fit(plaw, X, Y[ix])
    coeffs.append(c0)

但是根据XY的大小和循环,此过程可能非常慢,因此我试图通过尝试获取coeffs来加快速度,而无需循环.到目前为止,我还没有运气.

But this process can be very slow depending of the size of X, Y and the loop, so I'm trying to speed things up by trying to get coeffs without the need for a loop. So far I haven't had any luck.

可能重要的事情

  • XY仅包含正值
  • ab始终为正
  • 尽管本例中要拟合的数据是平滑的(为简单起见),但实际数据还是有噪声的.
  • X and Y only contain positive values
  • a and b are always positive
  • Although the data to fit in this example is smooth (for the sake of simplicity), the real data has noise

编辑

据我所知:

y=np.ma.masked_where(Y<1.01, Y)

lX = np.log(X)
lY = np.log(y)
A = np.vstack([lX, np.ones(len(lX))]).T
m,c=np.linalg.lstsq(A, lY.T)[0]

print('a=',-m)
print('b=',np.exp(-c/m))

但即使没有任何噪音,输出也将是:

But even without any noise the output is:

a= [0.18978965578339158 1.1353633705997466 2.220234483915197 3.3324502660995714]
b= [339.4090881838179 7.95073481873057 6.296592007396107 6.402567167503574]

比我希望得到的要差得多.

Which is way worse than I was hoping to get.

推荐答案

这里有三种加​​快速度的方法.您没有给出期望的速度或精度,甚至没有向量大小,因此买家要当心.

Here are three approaches to speeding this up. You gave no desired speed up or accuracies, or even vector sizes, so buyer beware.

时间:

len       1      2      3      4
1000    0.045  0.033  0.025  0.022
10000   0.290  0.097  0.029  0.023
100000  3.429  0.767  0.083  0.030
1000000               0.546  0.046

1) Original Method
2) Pre-estimate with Subset
3) M Newville [linear log-log estimate](https://stackoverflow.com/a/44975066/7311767)
4) Subset Estimate (Use Less Data)

使用子集进行预先估算(方法2):

只需运行两次curve_fit,即可获得不错的加速效果,其中第一次是使用数据的一小部分进行快速估算.然后,将该估算值用于为整个数据集播种curve_fit.

Pre-estimate with Subset (Method 2):

A decent speedup can be achieved by simply running the curve_fit twice, where the first time uses a short subset of the data to get a quick estimate. That estimate is then used to seed a curve_fit with the entire dataset.

x, y = current_data
stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]
return curve_fit(power_law, x, y, p0=c0)[0]

M Newville 线性对数-对数估计(方法3):

使用M Newville提出的对数估计也要快得多.由于OP担心Newville提出的初始估算方法,因此该方法使用curve_fit及其子集来提供曲线断点的估算.

M Newville linear log-log estimate (Method 3):

Using the log estimate proposed by M Newville, is also considerably faster. As the OP was concerned about the initial estimate method proposed by Newville, this method uses curve_fit with a subset to provide the estimate of the break point in the curve.

x, y = current_data
stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]

index_max = np.where(x > c0[1])[0][0]
log_x = np.log(x[:index_max])
log_y = np.log(y[:index_max])
result = linregress(log_x, log_y)
return -result[0], np.exp(-result[1] / result[0])
return (m, c), result

使用更少的数据(方法4):

最后,前两种方法所使用的种子机制对样本数据提供了很好的估计.当然,这是示例数据,因此您的里程可能会有所不同.

Use Less Data (Method 4):

Finally the seed mechanism used for the previous two methods provides pretty good estimates on the sample data. Of course it is sample data so your mileage may vary.

stride = int(max(1, len(x) / 200))
c0 = curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])[0]

测试代码:

import numpy as np
from numpy import random as rng
from scipy.optimize import curve_fit
from scipy.stats import linregress

fit_data = {}
current_data = None

def data_for_fit(a, b, n):
    key = a, b, n
    if key not in fit_data:
        rng.seed(0)
        x = np.logspace(-1, 1, n)
        y = np.clip((x / b) ** (-a) + 0.01 * rng.randn(n), 0.001, None)
        y[x > b] = 1
        fit_data[key] = x, y
    return fit_data[key]


def power_law(r, a, b):
    """ Power law for the shape of the normalized conditional density """
    import numpy as np
    return np.piecewise(
        r, [r < b, r >= b], [lambda x: (x/b)**-a, lambda x: 1])

def method1():
    x, y = current_data
    return curve_fit(power_law, x, y)[0]

def method2():
    x, y = current_data
    return curve_fit(power_law, x, y, p0=method4()[0])

def method3():
    x, y = current_data
    c0, pcov = method4()

    index_max = np.where(x > c0[1])[0][0]
    log_x = np.log(x[:index_max])
    log_y = np.log(y[:index_max])
    result = linregress(log_x, log_y)
    m, c = -result[0], np.exp(-result[1] / result[0])
    return (m, c), result

def method4():
    x, y = current_data
    stride = int(max(1, len(x) / 200))
    return curve_fit(power_law, x[0:len(x):stride], y[0:len(y):stride])

from timeit import timeit

def runit(stmt):
    print("%s: %.3f  %s" % (
        stmt, timeit(stmt + '()', number=10,
                     setup='from __main__ import ' + stmt),
        eval(stmt + '()')[0]
    ))

def runit_size(size):

    print('Length: %d' % size)
    if size <= 100000:
        runit('method1')
        runit('method2')
    runit('method3')
    runit('method4')


for i in (1000, 10000, 100000, 1000000):
    current_data = data_for_fit(3, 3, i)
    runit_size(i)

这篇关于一次完成多次curve_fit的迭代以实现分段功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆