检测数据集中线性行为的算法 [英] Algorithm to detect a linear behaviour in a data set

查看:59
本文介绍了检测数据集中线性行为的算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一段时间前,我发布了一个有关要对数据集的一部分进行多项式拟合的算法的问题并收到一些建议来做我想做的事.但是现在我面临另一个问题,我尝试应用答案中建议的想法. 我的目标是找到数据集的最佳线性拟合,其中只有一部分是线性的.

I have posted a question about an Algorithm to make a polynomial fit of a part of a data set some time ago and received some propositions to do what I wanted. But I face another problem now I try to apply the ideas suggested in the answers. My goal was to find the best linear fit of a data set, in which only a part of it was linear.

以下是我必须执行的操作的示例:

Here is an example of what I must do :

我们有这两个数据集,我必须对虚线左侧的数据的线性部分进行线性趋势处理.红色表示理想的数据集,该数据集从开始到虚线都具有线性部分.蓝色为有问题"的数据集,其处于平稳状态.粗体部分是我必须用来对数据进行线性拟合的部分.

We have these two data sets, and I must make a linear trend of the linear part of the data that is at the left of the dashed line. In red, we have the ideal data set, that has a linear part from the beginning until the dashed line. In blue, we have the 'problematic' data set, that has a plateau. The bold part is the part that I have to use to do the linear fit of the data.

我的问题是,我试图按照上面链接的问题中所述进行操作:我找到了平滑数据的二阶导数,并查看了该数据的足够接近" 0时的情况.但这是我针对有问题的结果数据集(第一张图片)和理想数据集(第二张图片):

My problem is that I tried to do as mentionned in the question linked above : I found the second order derivative of the smoothed data and looked when it was not 'close enough' of 0. But here are my results for the problematic data set (first image) and for the ideal data set (second image) :

(抱歉,质量,我不知道为什么它是如此模糊) 在这两个图像上,我绘制了一阶导数,并绘制了红色的二阶导数.在第一张图片上,我们看到了二阶导数值的峰值.但是问题在于,峰值不是很高,因此很难建立一个阈值来判断该集合是否线性...相反,一阶导数的峰值非常高,因此容易在视觉上看到.

(Sorry for quality, I don't know why it is so blurred) On both images, I plotted the first order derivative and in red, the second order derivative. On the first image, we see peaks of second derivative values. But the problem is that the peaks are not very 'high', making it difficult to establish a threshold that would tell if the set is linear or not... On the contrary, the peak of the first derivative is quite high, making it easy to see visually.

我认为计算一阶导数值的平均值,然后看该值与平均值之间的差异是否足够……但是当我取一阶导数值的平均值以看看这些值与平均值有何不同,由于峰值而存在某种偏移.

I thought that calculate the mean of the values of the first derivative and look when the value differ too much from the mean value would be enough... But when I take the mean of the values of the first derivative in order to see where the values differ from the mean value, there is a sort of offset due to the peak.

如何删除此偏移量,以便仅获取右侧数据的平均值(在图像1上看到的不连续性左侧的数据可能是非线性的,也可能是线性但与峰值右侧的值有不同!)

How to remove this offset in order to take only the mean value of the data at the right (the data at the left of the discontinuity that is seen on Image 1 could be non linear or be linear but have a different value from the values at the right!) of the peak efficiently ?

推荐答案

mean运算符(您已经注意到)对异常值(峰值)非常敏感.您可能希望使用更可靠的估算器,例如值的median或x百分数(应更适合您的情况).

The mean operator (as you have noticed) is very sensitive to outliers (peaks). You may wish to use more robust estimators, such as the median or the x-percentile of the values (which should be more appropriate for your case).

这篇关于检测数据集中线性行为的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆