如何选择部分密集数据集的均匀分布子集? [英] How to select a uniformly distributed subset of a partially dense dataset?

查看:128
本文介绍了如何选择部分密集数据集的均匀分布子集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

P是一个n * d矩阵,其中包含n个d维样本.在某些区域,P的密度是其他区域的几倍.我想选择P的子集,其中任何对样本之间的距离都大于d0,并且我需要将其散布到整个区域.所有样本都具有相同的优先级,无需进行任何优化(例如,覆盖区域或成对距离的总和).

P is an n*d matrix, holding n d-dimensional samples. P in some areas is several times more dense than others. I want to select a subset of P in which distance between any pairs of samples be more than d0, and I need it to be spread all over the area. All samples have same priority and there's no need to optimize anything (e.g. covered area or sum of pairwise distances).

这是一个执行此操作的示例代码,但这确实很慢.我需要更有效的代码,因为我需要多次调用它.

Here is a sample code that does so, but it's really slow. I need a more efficient code since I need to call it several times.

%% generating sample data
n_4 = 1000; n_2 = n_4*2;n = n_4*4;
x1=[ randn(n_4, 1)*10+30; randn(n_4, 1)*3 + 60];
y1=[ randn(n_4, 1)*5 + 35; randn(n_4, 1)*20 + 80 ];
x2 = rand(n_2, 1)*(max(x1)-min(x1)) + min(x1);
y2 = rand(n_2, 1)*(max(y1)-min(y1)) + min(y1);
P = [x1,y1;x2, y2];
%% eliminating close ones
tic
d0 = 1.5;
D = pdist2(P, P);D(1:n+1:end) = inf;
E = zeros(n, 1); % eliminated ones
for i=1:n-1
    if ~E(i)
        CloseOnes = (D(i,:)<d0) & ((1:n)>i) & (~E');
        E(CloseOnes) = 1;
    end
end
P2 = P(~E, :);
toc
%% plotting samples
subplot(121); scatter(P(:, 1), P(:, 2)); axis equal;
subplot(122); scatter(P2(:, 1), P2(:, 2)); axis equal;

子集应该多大?

j_random_hacker 在评论中指出,可以说P(1, :)是最快的答案,如果我们没有对所选样本的数量定义约束.精美地显示了标题的不连贯性!但是我认为当前的标题更好地描述了目的.因此,我们定义一个约束:尝试选择m个样本".现在,使用m=n的隐式假设,我们可以获得最大的可能子集.正如我之前提到的,一种更快的方法胜过能够找到最佳答案的方法.

As j_random_hacker pointed out in comments, one can say that P(1, :) is the fastest answer if we don’t define a constraint on the number of selected samples. It delicately shows incoherence of the title! But I think the current title better describes the purpose. So let’s define a constraint: "Try to select m samples if it’s possible". Now with the implicit assumption of m=n we can get the biggest possible subset. As I mentioned before a faster method excels the one that finds the optimum answer.

推荐答案

一遍又一遍地查找最接近的点将建议针对空间搜索进行优化的不同数据结构.我建议进行delaunay三角剖分.

Finding closest points over and over suggests a different data structure that is optimized for spatial searches. I suggest a delaunay triangulation.

在从某种意义上说,它可能会删除比严格必要要多的点,这是近似"的解决方案.我正在批处理所有计算,并在每次迭代中删除会导致距离过长的 all 点,在许多情况下,删除一个点可能会删除同一迭代中稍后出现的边.如果这很重要,则可以对边缘列表进行进一步处理以避免重复,甚至可以找到将影响最大距离的点删除.

The below solution is "approximate" in the sense that it will likely remove more points than strictly necessary. I'm batching all the computations and removing all points in each iteration that contribute to distances that are too long, and in many cases removing one point may remove the edge that appears later in the same iteration. If this matters, the edge list can be further processed to avoid duplicates, or even to find points to remove that will impact the greatest number of distances.

这很快.

dt = delaunayTriangulation(P(:,1), P(:,2));
d0 = 1.5;

while 1
    edge = edges(dt);  % vertex ids in pairs

    % Lookup the actual locations of each point and reorganize
    pwise = reshape(dt.Points(edge.', :), 2, size(edge,1), 2);
    % Compute length of each edge
    difference = pwise(1,:,:) - pwise(2,:,:);
    edge_lengths = sqrt(difference(1,:,1).^2 + difference(1,:,2).^2);

    % Find edges less than minimum length
    idx = find(edge_lengths < d0);
    if(isempty(idx))
        break;
    end

    % pick first vertex of each too-short edge for deletion
    % This could be smarter to avoid overdeleting
    points_to_delete = unique(edge(idx, 1));

    % remove them.  triangulation auto-updates
    dt.Points(points_to_delete, :) = [];

    % repeat until no edge is too short
end

P2 = dt.Points;

这篇关于如何选择部分密集数据集的均匀分布子集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆