删除重复项会删除太多点 [英] Removing repetitions removes too many points

查看:88
本文介绍了删除重复项会删除太多点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从一组键值对中删除重复数据.这些重复具有完全相同的键,或者键可以彼此非常接近.在那种情况下,我只想保留最大的键值对.

I am trying to remove repetitive data from an a set of key value pairs. Those repetitions have exactly the same key or the keys can be very close to each other. In those cases I only want to keep the key value pair with the largest value.

这个答案

ind=-1;
while(~isempty(ind))
  %find the non-max point
  Max=([diff(vals) 0]<0 & [0 -diff(vals)]<0); 
  Nind=1:length(vals);
  Nind(Max)=[];

  %determine the range of points
  Cind=[0 diff(keys)<0.5 & abs(diff(keys)>0.01)];
  Cind(find(Cind)-1)=1;
  vec=1:length(Cind);
  Cind=Cind.*vec;
  Cind(Cind == 0)=[];

  %check through & back
  ind=intersect(Cind,Nind);
  keys(ind)=[];
  vals(ind)=[];
end

适用于给定的一对配对

keys = [1 2 3 3.1 3.15 4 5];
vals = [0.8 1 1.1 1.3 1.2 1 1.1];

所以当输入看起来像

然后输出看起来像这样

删除36周围的重复.

但是,如果我对集合应用相同的解决方案

However if I apply the same solution to the set

keys = [414 414 999 1011 1070 1280 1280 1635 1641 1793 1799 1870 1872 1886 2213 2214 2225 2572 3778 3790 4970];
values = [1.100 1.100 0.316 0.198 0.224 0.555 0.555 0.443 0.374 0.387 0.510 0.446 0.456 0.347 0.224 0.229 0.171 0.175 0.202 0.183 0.147];

并相应地将阈值更改为

Cind=[0 diff(keys)<13 & abs(diff(keys)>0.01)];

然后输入看起来像

输出看起来像

在这种情况下,问题在于删除了太多的点.例如,在红色圆圈中,该组中的最大点被删除,并且该区域中的三个点中,尽管距离远高于设置的阈值13,但仅保留了一个点.尽管所有较大的值都被删除,但在1635处的点也被删除了.再走13点.

The problem in this case is that too many points are removed. For example in the red circle the largest point in the group is removed and of the three points in the region only one is kept although the distance is well above the set threshold of 13. Also the point at 1635 is removed although all larger values are more then 13 away.

我在这里误会什么?

所需的输出将是那些键值对非常接近的那些键值对的输出,只有其中一个值最大的键将被保留,而另一个键将被从这两个键值中移除数组.我指出了应该合并为该图中最大值的那些点:

The desired output would be that of those key value pairs where the keys are very close to each other only the one with the largest value would be kept and the other would be removed from both arrays. I indicated those points that should be merged to the largest value in this plot:

因此,所需的输出数组将是:

Edit 2: The desired output array would therefore be:

keys = [414 999 1070 1280 1635 1799 1872 1886 2213 2225 2572 3778 4970];
vals = [1.100 0.316 0.224 0.555 0.443 0.510 0.456 0.347 0.224 0.171 0.175 0.202 0.147];

推荐答案

这是一种直接,非常简单的策略,该策略仅包含一些if语句并一次删除一个点,但是仍然有效.

Here is a straightforward, pretty simple strategy, which only contains some if statements and delete one point at a time, but it works anyway.

但是,以下代码的复杂度为 O(N ^ 2),与向量化无关,当输入变得可观时,这将非常耗时.

However, the code following has the complexity of O(N^2) and has nothing to do with the vectorization, which will be very time consuming when the input became considerable.

%% Input
clc; clear;
keys = [414 414 999 1011 1070 1280 1280 1635 1641 1793 1799 1870 1872 1886 2213 2214 2225 2572 3778 3790 4970];
vals = [1.100 1.100 0.316 0.198 0.224 0.555 0.555 0.443 0.374 0.387 0.510 0.446 0.456 0.347 0.224 0.229 0.171 0.175 0.202 0.183 0.147];

%% Dealing
[len,flag]=deal(13,1);
while flag
  flag=0;
  for ii=2:length(keys)
    if ((keys(ii)-keys(ii-1) > len))
      continue;
    else
      if (vals(ii) > vals(ii-1))
        keys(ii-1)=[];
        vals(ii-1)=[];
      else
        keys(ii)=[];
        vals(ii)=[];
      end
      flag=1;
      break;
    end
  end
end

%% plot
figure(1)
plot(keys,vals)
hold on 
plot(keys,vals,'ro')
for ii=1:length(vals)
  text(keys(ii),vals(ii),num2str(ii))  
end

代码将输出:

这篇关于删除重复项会删除太多点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆