在MATLAB时间戳过滤器的优化 - 具有非常大的数据集工作 [英] Optimization of timestamp filter in MATLAB - Working with very large datasets

查看:147
本文介绍了在MATLAB时间戳过滤器的优化 - 具有非常大的数据集工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写在MATLAB程序(必须使用MATLAB,并不能真正使用MEX)来过滤非常大的数据。

一个我需要实现过滤器需要我一​​个时间戳向量比较与周围不能出现其他时间戳的著名的坏的一个时间列表。

一个典型的时间戳向量约2000000个条目,我有30万左右的列表不好的时候。

下面是一个工作的例子,如果 TIME = [1,2.3,5.5,9.1,10]; BAD_TIMES = [5.2,9.3 ]; ,我们有一个宽容容差= 0.25; ,然后在 TIME所有时间戳> 和 9.05和9.55 必须被删除。这意味着,清理矢量 TIME_CLEAN 应该等于 TIME_CLEAN = [1,2.3,5.5,10];

这问题很简单解决的,我已经在大约4或5个不同的方式解决它。然而,对于1,000,000时间戳工作,这个问题可以很容易地花一个小时来计算。

我期待解决此类问题在不到2分钟的典型睿i7处理器的工作站上此过滤器是可行的这种大量的时间条目。

我已经包含了这个code的工作版本。我明白code矢量化和功能,如 bsxfun()可以提供帮助,但这种改善是相对的效率我需要这个过滤器的类型边际。

是否有非常聪明的方式在一个非常有效的方式来解决这个问题?任何帮助将大大AP preciated。

P.S。在code以下是完整的;它产生需要设置问题的所有数据和解决它(虽然很慢!)。更改 NO_OF_TIMESTAMPS 变量更大的东西(比如1,000,000)看它爬!

 清除所有%%清除工作区
关闭所有%%关闭图形
CLC %%清除命令窗口NO_OF_TIMESTAMPS = 10000;时间戳在原始数据%% NUMBER容差= 2; %%公差AROUND TIMESTAMPA =排序(兰迪(NO_OF_TIMESTAMPS / 10,NO_OF_TIMESTAMPS,1)); %%人工生成戳B =唯一的(排序(圆(兰迪([NO_OF_TIMESTAMPS / 2,NO_OF_TIMESTAMPS * 5],[NO_OF_TIMESTAMPS / 10,1])/ 10))); %%会产生不良的时间戳列表人工B_LB = B-容忍; %% CREATE介于LOWERBOUND BAD时间戳列表
B_UB = B +耐受性; %% CREATE UPPERBPUND BAD时间戳列表
B_RANGE = [B_LB B_UB] %%矩阵增广向量B_LB和B_UB组成A_ROWS =尺寸(A,1);期一%%大小;B_ROWS =尺寸(B,1); B对%%大小;抽动; %%启动定时器A_TO_CLEAN =那些(A_ROWS,1); %%布尔VECTOR了将使用的滤波
第II = 1:A_ROWS    对于JJ = 1:B_ROWS        如果A(II)> = B_RANGE(JJ,1)及和放大器; A(II)< = B_RANGE(JJ,2)%%检查的每个成员VERSUS B_RANGE的每个成员           A_TO_CLEAN(II)= 0; %% SET指数VECTOR A_TO_CLEAN = 0,使我们可以在以后删除           打破; %% A(II)只能使用一次删除,所以BREAK JJ循环,进入下一II        结束    结束结束CLEAN = A(~~ A_TO_CLEAN); %%删除经逻辑分度TOC; %%结束定时器clearvars -except一个B_RANGE CLEAN %%只显示相关的变量


解决方案

诀窍使这一高效首先两个向量进行排序。然后通过载体之一创建一个简单的循环,同时保持的索引描述的最接近元素的第二向量。也就是说,你将有类似

 为IX1 = 1:长度(时间戳)
    而(倒霉透(IX2)LT;时间戳(IX1)
        IX2 = IX2 + 1;
    结束
    %检查时间戳(IX1)针对这个邮件(IX2),也许这个邮件(IX2 + 1)和这个邮件(IX2 - 1)
结束

排序是比较有效的,特别是使用内置的插件。现在你只需要一个循环。

这现在具有的一个一个相似的部分归并排序算法

I am writing a program in MATLAB (must use MATLAB and can't really use a MEX) to filter very large amounts of data.

One of the filters I need to implement requires me to compare a timestamp vector versus a list of known "bad" times around which other timestamps cannot occur.

A typical timestamp vector has about 2,000,000 entries, and I have a list of about 300,000 "bad times."

Here's an working example, if TIME=[1, 2.3, 5.5, 9.1, 10];, and BAD_TIMES=[5.2, 9.3];, and we have a tolerance tolerance=0.25;, then all timestamps in TIME between 4.95 and 5.45 and 9.05 and 9.55 must be erased. This means that the cleaned vector TIME_CLEAN should be equal to TIME_CLEAN=[1, 2.3, 5.5, 10];.

This problem is straightforward to solve, and I have solved it in about 4 or 5 different ways. However, for a 1,000,000 timestamp job, this problem can easily take an hour to compute.

I am looking to solve this type of problem in under 2 minutes on a typical Core-i7 workstation for this filter to be viable with this many time entries.

I have included a working version of this code. I understand code vectorization and functions such as bsxfun() can help, but the improvement is marginal relative to the type of efficiency I require for this filter.

Are there any very clever ways to solve this problem in a very efficient fashion? Any help would be greatly appreciated.

P.S. The code below is complete; it generates all data needed to setup the problem and solves it (although VERY slowly!). Change the NO_OF_TIMESTAMPS variable to something larger (such as 1,000,000) to watch it crawl!

clear all %% CLEAR WORKSPACE
close all %% CLOSE FIGURES
clc %% CLEAR COMMAND WINDOW

NO_OF_TIMESTAMPS=10000; %% NUMBER OF TIMESTAMPS IN ORIGINAL DATA

TOLERANCE=2; %% TOLERANCE AROUND TIMESTAMP

A=sort(randi(NO_OF_TIMESTAMPS/10,NO_OF_TIMESTAMPS,1)); %% GENERATE ARTIFICIAL TIMESTAMPS

B=unique(sort(round(randi([NO_OF_TIMESTAMPS/2,NO_OF_TIMESTAMPS*5],[NO_OF_TIMESTAMPS/10,1])/10))); %% GENERATE ARTIFICIAL LIST OF BAD TIMESTAMPS

B_LB=B-TOLERANCE; %% CREATE A LIST OF LOWERBOUND BAD TIMESTAMPS
B_UB=B+TOLERANCE; %% CREATE A LIST OF UPPERBPUND BAD TIMESTAMPS
B_RANGE=[B_LB B_UB]; %% AUGMENTED MATRIX COMPOSED OF VECTORS B_LB and B_UB

A_ROWS=size(A,1); %% SIZE OF A;

B_ROWS=size(B,1); %% SIZE OF B;

tic; %% START TIMER

A_TO_CLEAN=ones(A_ROWS,1); %% BOOLEAN VECTOR TO BE USED IN FILTERING
for ii=1:A_ROWS

    for jj=1:B_ROWS

        if A(ii)>=B_RANGE(jj,1) && A(ii)<=B_RANGE(jj,2) %% CHECK EACH MEMBER OF A VERSUS EACH MEMBER OF B_RANGE

           A_TO_CLEAN(ii)=0; %% SET INDEX VECTOR A_TO_CLEAN = 0 SO THAT WE CAN DELETE LATER

           break; %% A(ii) CAN ONLY BE ERASED ONCE, SO BREAK jj LOOP AND GO TO NEXT ii

        end

    end

end

CLEAN=A(~~A_TO_CLEAN); %% DELETE A VIA LOGICAL INDEXING

toc; %% END TIMER

clearvars -except A B_RANGE CLEAN %% ONLY SHOW RELEVANT VARIABLES

解决方案

The trick to making this efficient to to first sort both vectors. Then create a simple loop through one of the vectors, while maintaining an index into the second vector describing the closest element. That is, you will have something like

for ix1 = 1:length(timestamps)
    while (badTimes(ix2) < timestamps(ix1)
        ix2 = ix2+1;
    end
    %check timestamp(ix1) against badTimes(ix2), and maybe badTimes(ix2 + 1) and  badTimes(ix2 - 1)
end

Sorting is relatively efficient, especially using the built-ins. And now you only need a single loop.

This now bears a resemblance portions of the a merge sort algorithm.

这篇关于在MATLAB时间戳过滤器的优化 - 具有非常大的数据集工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆