直方图中的峰数 [英] Number of peaks in histogram

查看:232
本文介绍了直方图中的峰数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据表示一些强度值。我想检测这些数据中的组分数(具有相似强度的点的聚类,或者根据这些数据创建的直方图中的峰数)。



:, AIC 或任何适合。最后,使用肘法确定群集数。然而,我想尽可能快地检测峰的近似数目,并且拟合高斯混合是非常耗时的过程。



我的方法 p>

所以我想出了下面的方法(在C ++中)。它采用直方图面元高度(y),并搜索其中y值开始下降的指数。然后过滤小于y公差(yt)的值。最后,接近其他使用x容差(xt)的索引也被过滤:

 索引StatUtils :: findLocalMaximas Point1D& y,int xt,int yt){

//结果索引
索引索引;

//找到所有局部最大值
int imax = 0;
double max = y [0];
bool inc = true;
bool dec = false
for(int i = 1; i
//从下降到增加,重置最大
if(dec& & y [i-1]< y [i]){
max = std :: numeric_limits&
dec = false;
inc = true;
}

//从增加变为下降,保存索引为最大
if(inc& y [i-1]> y [i]){
indices.append(imax);
dec = true;
inc = false;
}

//更新最大值
if(y [i]> max){
max = y [i]
imax = i;
}
}

//如果峰值太小,忽略它
int i = 0;
while(indices.count()> = 1&& i< indices.count()){
if(y [indices.at(i)] indices.removeAt(i);
} else {
i ++;
}
}

//如果两个峰相互靠近,只取最大的一个
i = 1;
while(indices.count()> = 2&& i< indices.count()){
int index1 = indices.at(i - 1);
int index2 = indices.at(i);
if(abs(index1-index2) indices.removeAt(y [index1]< y [index2]?i-1:i);
} else {
i ++;
}
}
return index;
}

方法问题



这个解决方案的问题很大程度上取决于这些公差值(xt和yt)。所以我必须有关峰的最小允许距离的信息。此外,在我的数据中有孤立的异常值高于那些较小峰的最大值。



您可以建议一些其他方法,如何确定与附图类似的数据的峰值数量。

解决方案

您可以使用我的近似高斯混合法< a>:




  • 这是一种强大的统计方法。


  • p>它不依赖于绝对阈值;它只有两个相对(规范化)数量的参数,容易控制,并且相同的值适用于不同的数据集


  • ,它在单个EM(期望最大化)运行中动态地估计模式的数量。


  • 它是快速的,因为它使用近似最近邻(ANN)每次迭代搜索和更新只考虑k个最近邻居,而不是所有数据点。




在线 Matlab演示,以便您可以轻松地对小数据集进行实验。在我们的C ++实现中,我们使用 FLANN 进行大规模最近邻搜索。不幸的是,这个实现不是公开的,但如果你有兴趣,我可以给你一些版本。


I have 1D data that represent some intensity values. I want to detect number of components in these data (clusters of points with similar intensity, or alternatively number of "peaks" in histogram created from this data).

This approach: 1D multiple peak detection? is not very useful for me, because one "peak" can contain more local maximums (see image below).

Of cause, I can use statistical approach, for example, I can try to fit data for 1,2,3,....n peaks, then calculate BIC, AIC or whatever for each fit. And finally use elbow method for number of clusters determination. However, I want to detect approximate number of peaks as fast as possible and fitting gaussian mixture is quite time consuming procedure.

My approach

So I came up with following approach (in C++). It takes histogram bins heights (y) and searches for indices in which y values start to decline. Then values lower than y tolerance (yt) are filtered. And finally, indices that are near to other using x tolerance (xt) are filtered too:

Indices StatUtils::findLocalMaximas(const Points1D &y, int xt, int yt) {

  // Result indices
  Indices indices;

  // Find all local maximas
  int imax = 0;
  double max = y[0];
  bool inc = true;
  bool dec = false;
  for (int i = 1; i < y.size(); i++) {    

    // Changed from decline to increase, reset maximum
    if (dec && y[i - 1] < y[i]) {
      max = std::numeric_limits<double>::min();
      dec = false;
      inc = true;
    }

    // Changed from increase to decline, save index of maximum
    if (inc && y[i - 1] > y[i]) {
       indices.append(imax);
       dec = true;
       inc = false;
    }

    // Update maximum
    if (y[i] > max) {
       max = y[i];
       imax = i;
    }
  }

  // If peak size is too small, ignore it
  int i = 0;
  while (indices.count() >= 1 && i < indices.count()) {
    if (y[indices.at(i)] < yt) {
      indices.removeAt(i);
    } else {
      i++;
    }
  }

  // If two peaks are near to each other, take only the largest one
  i = 1;
  while (indices.count() >= 2 && i < indices.count()) {
    int index1 = indices.at(i - 1);
    int index2 = indices.at(i);
    if (abs(index1 - index2) < xt) {
      indices.removeAt(y[index1] < y[index2] ? i-1 : i);
    } else {
      i++;
    }
  }
  return indices;
}

Problem with approach

Problem with this solution is that strongly depends on those tolerance values (xt and yt). So I have to have information about minimum allowed distance among peaks. Moreover, there are isolated outliers in my data that are higher then maximums of those smaller peaks.

Could you suggest some other approach how to determine number of peaks for data similar to those in attached figure.

解决方案

You could use my method of approximate Gaussian mixtures:

  • it is a robust statistical method

  • it does not depend on absolute thresholds; it only has two parameters that are relative (normalized) quantities, are easily controlled, and same values apply to different datasets

  • unlike the elbow method and most statistical methods, it estimates the number of modes dynamically in a single EM (expectation-maximization) run. It starts with every data point as an independent mode and deletes "overlapping" modes at every iteration.

  • it is fast because it employs approximate nearest neighbor (ANN) search at each iteration and its updates take into account only the k nearest neighbors, not all data points.

There is an online Matlab demo so you can easily experiment on a small dataset. In our C++ implementation we use FLANN for nearest neighbor search at large scale. Unfortunately this implementation is not public but I could give you some version if you're interested.

这篇关于直方图中的峰数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆