如何对包含NaN的大型多维数组中的每个像素应用线性回归? [英] How to apply linear regression to every pixel in a large multi-dimensional array containing NaNs?

查看:118
本文介绍了如何对包含NaN的大型多维数组中的每个像素应用线性回归?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个独立变量值(x_array)的一维数组,该数组与具有多个时间步长(y_array)的空间数据3D numpy数组中的时间步长匹配.我的实际数据要大得多:300多个时间步长,最大3000 * 3000像素:

import numpy as np
from scipy.stats import linregress

# Independent variable: four time-steps of 1-dimensional data 
x_array = np.array([0.5, 0.2, 0.4, 0.4])

# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2,   -0.2,   -0.3],
                     [-0.3,   -0.2,   -0.3],
                     [-0.3,   -0.4,   -0.4]],

                    [[-0.2,   -0.2,   -0.4],
                     [-0.3,   np.nan, -0.3],
                     [-0.3,   -0.3,   -0.4]],

                    [[np.nan, np.nan, -0.3],
                     [-0.2,   -0.3,   -0.7],
                     [-0.3,   -0.3,   -0.3]],

                    [[-0.1,   -0.3,   np.nan],
                     [-0.2,   -0.3,   np.nan],
                     [-0.1,   np.nan, np.nan]]])

我想计算每个像素的线性回归并获得y_array中每个xy像素的R平方,P值,截距和斜率,而x_array中每个时间步长的值都作为我的自变量.

我可以重塑形状,以某种形式获取数据,然后将其输入到矢量化且快速的np.polyfit中:

# Reshape so rows = number of time-steps and columns = pixels:
y_array_reshaped = y_array.reshape(len(y_array), -1)

# Do a first-degree polyfit
np.polyfit(x_array, y_array_reshaped, 1)

但是,这将忽略包含任何NaN值的像素(np.polyfit不支持NaN值),并且不计算我需要的统计信息(R平方,P值,截距和斜率).

在此处回答使用scipy.stats import linregress它确实计算出我需要的统计信息,并建议通过屏蔽这些NaN值来避免出现NaN问题.但是,此示例适用于两个一维数组,因此我无法弄清楚如何对y_array_reshaped中的每一列具有不同的NaN值集的情况应用类似的屏蔽方法.

我的问题:如何以合理的快速矢量化方式为包含许多NaN值的大型多维数组(300 x 3000 x 3000)中的每个像素计算回归统计?

所需结果:对于y_array中每个像素的3 x 3的回归统计值数组(例如R平方),即使该像素在像素中的某个点上包含NaN值也是如此.时间序列

解决方案

上面评论中提到的这篇博客文章包含了一个非常快的矢量化函数,用于在Python中对多维数据进行互相关,协方差和回归.它产生我需要的所有回归输出,并且以毫秒为单位,因为它完全依赖于xarray中的简单向量化数组操作.

https://hrishichandanpurkar.blogspot.com/2017 /09/vectorized-functions-for-correlation.html

我做了一个小的更改(#3之后的第一行),以确保该功能正确地说明了每个像素中不同数量的NaN值:

def lag_linregress_3D(x, y, lagx=0, lagy=0):
"""
Input: Two xr.Datarrays of any dimensions with the first dim being time. 
Thus the input data could be a 1D time series, or for example, have three 
dimensions (time,lat,lon). 
Datasets can be provided in any order, but note that the regression slope 
and intercept will be calculated for y with respect to x.
Output: Covariance, correlation, regression slope and intercept, p-value, 
and standard error on regression between the two datasets along their 
aligned time dimension.  
Lag values can be assigned to either of the data, with lagx shifting x, and
lagy shifting y, with the specified lag amount. 
""" 
#1. Ensure that the data are properly alinged to each other. 
x,y = xr.align(x,y)

#2. Add lag information if any, and shift the data accordingly
if lagx!=0:

    # If x lags y by 1, x must be shifted 1 step backwards. 
    # But as the 'zero-th' value is nonexistant, xr assigns it as invalid 
    # (nan). Hence it needs to be dropped
    x   = x.shift(time = -lagx).dropna(dim='time')

    # Next important step is to re-align the two datasets so that y adjusts
    # to the changed coordinates of x
    x,y = xr.align(x,y)

if lagy!=0:
    y   = y.shift(time = -lagy).dropna(dim='time')
    x,y = xr.align(x,y)

#3. Compute data length, mean and standard deviation along time axis: 
n = y.notnull().sum(dim='time')
xmean = x.mean(axis=0)
ymean = y.mean(axis=0)
xstd  = x.std(axis=0)
ystd  = y.std(axis=0)

#4. Compute covariance along time axis
cov   =  np.sum((x - xmean)*(y - ymean), axis=0)/(n)

#5. Compute correlation along time axis
cor   = cov/(xstd*ystd)

#6. Compute regression slope and intercept:
slope     = cov/(xstd**2)
intercept = ymean - xmean*slope  

#7. Compute P-value and standard error
#Compute t-statistics
tstats = cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
stderr = slope/tstats

from scipy.stats import t
pval   = t.sf(tstats, n-2)*2
pval   = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)

return cov,cor,slope,intercept,pval,stderr

I have a 1D array of independent variable values (x_array) that match the timesteps in a 3D numpy array of spatial data with multiple time-steps (y_array). My actual data is much larger: 300+ timesteps and up to 3000 * 3000 pixels:

import numpy as np
from scipy.stats import linregress

# Independent variable: four time-steps of 1-dimensional data 
x_array = np.array([0.5, 0.2, 0.4, 0.4])

# Dependent variable: four time-steps of 3x3 spatial data
y_array = np.array([[[-0.2,   -0.2,   -0.3],
                     [-0.3,   -0.2,   -0.3],
                     [-0.3,   -0.4,   -0.4]],

                    [[-0.2,   -0.2,   -0.4],
                     [-0.3,   np.nan, -0.3],
                     [-0.3,   -0.3,   -0.4]],

                    [[np.nan, np.nan, -0.3],
                     [-0.2,   -0.3,   -0.7],
                     [-0.3,   -0.3,   -0.3]],

                    [[-0.1,   -0.3,   np.nan],
                     [-0.2,   -0.3,   np.nan],
                     [-0.1,   np.nan, np.nan]]])

I want to compute a per-pixel linear regression and obtain R-squared, P-values, intercepts and slopes for each xy pixel in y_array, with values for each timestep in x_array as my independent variable.

I can reshape to get the data in a form to input it into np.polyfit which is vectorised and fast:

# Reshape so rows = number of time-steps and columns = pixels:
y_array_reshaped = y_array.reshape(len(y_array), -1)

# Do a first-degree polyfit
np.polyfit(x_array, y_array_reshaped, 1)

However, this ignores pixels that contain any NaN values (np.polyfit does not support NaN values), and does not calculate the statistics I require (R-squared, P-values, intercepts and slopes).

The answer here uses scipy.stats import linregress which does calculate the statistics I need, and suggests avoiding NaN issues by masking out these NaN values. However, this example is for two 1D arrays, and I can't work out how to apply a similar masking approach to my case where each column in y_array_reshaped will have a different set of NaN values.

My question: How can I calculate regression statistics for each pixel in a large multidimensional array (300 x 3000 x 3000) containing many NaN values in a reasonably fast, vectorised way?

Desired result: A 3 x 3 array of regression statistic values (e.g. R-squared) for each pixel in y_array, even if that pixel contains NaN values at some point in the time series

解决方案

This blog post mentioned in the comments above contains an incredibly fast vectorized function for cross-correlation, covariance, and regression for multi-dimensional data in Python. It produces all of the regression outputs I need, and does so in milliseconds as it relies entirely on simple vectorised array operations in xarray.

https://hrishichandanpurkar.blogspot.com/2017/09/vectorized-functions-for-correlation.html

I have made one minor change (first line after #3) to ensure the function correctly accounts for different numbers of NaN values in each pixel:

def lag_linregress_3D(x, y, lagx=0, lagy=0):
"""
Input: Two xr.Datarrays of any dimensions with the first dim being time. 
Thus the input data could be a 1D time series, or for example, have three 
dimensions (time,lat,lon). 
Datasets can be provided in any order, but note that the regression slope 
and intercept will be calculated for y with respect to x.
Output: Covariance, correlation, regression slope and intercept, p-value, 
and standard error on regression between the two datasets along their 
aligned time dimension.  
Lag values can be assigned to either of the data, with lagx shifting x, and
lagy shifting y, with the specified lag amount. 
""" 
#1. Ensure that the data are properly alinged to each other. 
x,y = xr.align(x,y)

#2. Add lag information if any, and shift the data accordingly
if lagx!=0:

    # If x lags y by 1, x must be shifted 1 step backwards. 
    # But as the 'zero-th' value is nonexistant, xr assigns it as invalid 
    # (nan). Hence it needs to be dropped
    x   = x.shift(time = -lagx).dropna(dim='time')

    # Next important step is to re-align the two datasets so that y adjusts
    # to the changed coordinates of x
    x,y = xr.align(x,y)

if lagy!=0:
    y   = y.shift(time = -lagy).dropna(dim='time')
    x,y = xr.align(x,y)

#3. Compute data length, mean and standard deviation along time axis: 
n = y.notnull().sum(dim='time')
xmean = x.mean(axis=0)
ymean = y.mean(axis=0)
xstd  = x.std(axis=0)
ystd  = y.std(axis=0)

#4. Compute covariance along time axis
cov   =  np.sum((x - xmean)*(y - ymean), axis=0)/(n)

#5. Compute correlation along time axis
cor   = cov/(xstd*ystd)

#6. Compute regression slope and intercept:
slope     = cov/(xstd**2)
intercept = ymean - xmean*slope  

#7. Compute P-value and standard error
#Compute t-statistics
tstats = cor*np.sqrt(n-2)/np.sqrt(1-cor**2)
stderr = slope/tstats

from scipy.stats import t
pval   = t.sf(tstats, n-2)*2
pval   = xr.DataArray(pval, dims=cor.dims, coords=cor.coords)

return cov,cor,slope,intercept,pval,stderr

这篇关于如何对包含NaN的大型多维数组中的每个像素应用线性回归?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆