在大型数据集上有效地计算分段回归 [英] Efficiently calculating a segmented regression on a large dataset

查看:132
本文介绍了在大型数据集上有效地计算分段回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前有一个大数据集,为此我需要计算分段回归(或以某种相似的方式拟合分段线性函数).但是,我既有大量数据集,又有大量数据.

I currently have a large data set, for which I need to calculate a segmented regression (or fit a piecewise linear function in some similar way). However, I have both a large data set, as well as a very large number of pieces.

目前,我采用以下方法:

Currently I have the following approach:

  • 让s i 作为段i的结尾
  • 让(x i ,y i )表示第i个数据点
  • Let si be the end of segment i
  • Let (xi,yi) denote the i-th data point

假设数据点x k 位于线段j中,那么我可以根据

Assume the data point xk lies within segment j, then I can create a vector from xk as

(s 1 ,s 2 -s 1 ,s 3 -s 2 ,...,x k -s j-1 ,0,0,...)

(s1,s2-s1,s3-s2,...,xk-sj-1,0,0,...)

要对数据点进行分段回归,我可以对每个向量进行正态线性回归.

To do a segmented regression on the data point, I can do a normal linear regression on each of these vectors.

但是,我目前的估计表明,如果我用这种方式定义问题,我将获得约600.000个向量,每个向量具有约2.000个分量.我尚未进行基准测试,但我认为我的计算机无法在任何可接受的时间内计算出如此大的回归问题.

However, my current estimates show, that if I define the problem that way, I will get about 600.000 vectors with about 2.000 components each. I haven't benchmarked yet, but I don't think my computer will be able to calculate such a large regression problem in any acceptable time.

是否有更好的方法来计算这种回归问题?一个想法是可能使用某种分层方法,即通过组合多个段来计算一个回归问题,以便我可以确定该集合的起点和终点.然后针对这组细分计算单独的细分回归.但是,我无法弄清楚如何计算这组线段的回归值,从而使端点匹配(我只能通过固定截距来匹配起点或终点,而不能同时匹配两者).

Is there a better way to calculate this kind of regression problem? One idea was to maybe use some kind of hierarchical approach, i.e. calculate one regression problem by combining multiple segments, so that I can determine start and endpoints for this set. Then calculate an individual segmented regression for this set of segments. However, I cannot figure out how to calculate the regression for this set of segments, so that the endpoints match (I can only match start or endpoint by fixing the intercept but not both).

我的另一个想法是为每个线段计算一个单独的回归,然后仅对该线段使用斜率.但是,采用这种方法,错误可能开始累积,并且我无法控制这种错误累积.

Another idea I had was to calculate an individual regression for each of the segments and then only use the slope for that segment. However with that approach, errors might start to accumulate and I have no way to control for this kind of error accumulation.

另一个想法是,我可以对每个段进行单独回归,但将截距固定到上一个段的端点.但是,我仍然不确定是否可以通过这种方式积累某种错误.

Yet another ideas is that I could do individual regression for each segment, but fix the intercept to the endpoint of the previous segment. However, I still am not sure, if I may get some kind of error accumulation this way.

说明

不确定其余问题是否清楚.我知道分段的起点和终点.最重要的部分是,我必须使每个线段在线段边界处与下一个线段相交.

Not sure if this was clear from the rest of the question. I know where the segments start and end. The most important part is, that I have to get each line segment to intersect at the segment boundary with the next segment.

编辑

也许另一个可以帮助的事实.所有点都有不同的x值.

Maybe another fact that could help. All points have different x values.

推荐答案

我会将点分组到矩形网格区域

根据他们的位置.因此,您可以在更小的数据集上处理此任务,然后在完成后将结果合并在一起.

based on their position. So you process this task on more smaller datasets and then merge the results together when all done.

我将像这样处理每个组:

  1. 计算角度直方图
  2. 仅采用最常出现的角度

其计数确定组中存在的线段的数量

their count determine the number of line segments present in group

对这些角度进行回归/直线拟合

查看此答案它所做的事情非常相似(仅一行)

See this Answer it does something very similar (just single line)

计算相交点

获取分段折线的端点以及连通性信息(加入最近的端点)

between line segments to get the endpoints of your piecewise polyline and also connectivity info (join the closest endpoints)

OP编辑后

[edit1]

您知道所有线段(x0,x1,...)的边线x坐标,因此只需计算线段边沿附近的点(灰色区域,绿色点)的平均y坐标,即可得到线段端点(蓝色点).粗略地说,由于丢弃所有其他点,因此不适合或不进行回归,因此会导致更大的误差(除非段x所协调的线段对应于回归线...),但是解决方案的约束无法解决它有(至少我看不到).

You know the edge x coordinates of all segments (x0,x1,...) so just compute average y coordinates of points near segment edge (gray area, green points) and You got the segment line endpoints (blue points). Of coarse this is no fit or regression because of discard all the other points so it leads to bigger errors (unless the segment x coordinated corresponds to regressed lines ...) but there is no way around it with the constrains of solution you have (at least I do not see any).

因为如果对细分数据使用回归,则无法将其连接到其他细分,并且如果尝试合并它们,则得到的结果几乎与此相同:

Because if you use regression on segment data then you can not connect it to other segments and if you try to merge them then you got almost the same result as this:

灰色区域的大小决定了输出...因此,请稍等一下...

the size of gray area determine the output ... so play with it a bit ...

这篇关于在大型数据集上有效地计算分段回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆