字母顺序排序列表中的字符串匹配(MATLAB方法) [英] string matching in an alpbetically ordered list (the MATLAB way)

查看:551
本文介绍了字母顺序排序列表中的字符串匹配(MATLAB方法)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的按字母顺序排列的字符串单元格数组(约495,000),其中有很多重复项(由于字母顺序,所以彼此相邻).

I have a large alphabetically ordered cell array of strings (~495 thousand), with lots of duplicates (which are next to each other, because it's alphabetical).

对于给定的查找字符串,我需要在列表中找到与我传入的字符串匹配的所有字符串.

For a given look-up string, I need to find all the strings in the list which will match the one I pass in.

我一直在使用strcmp(lookUpString,list)来执行此操作,但这非常慢-我认为它正在遍历列表中的每个值进行比较,因为它不知道它是按字母顺序排序的.

I've been using strcmp(lookUpString,list) to do this, but this is extremely slow-- I think it's going through each value in the list to compare, because it doesn't know it's alphabetically sorted.

我可以编写一个while循环来遍历列表,以使用strcmp比较每个字符串,直到找到所需的字符串块(然后停止),但是我想知道是否存在"matlab"方法执行此操作(即对排序后的数组执行逻辑比较操作).

I could write a while loop to iterate through the list to compare each string using strcmp until I find the block of strings I want (and then stop), but I was wondering if there was a "matlab" way of doing this (i.e. performing logical comparison operations on a sorted array).

感谢您的帮助!

推荐答案

更新:我对之前的方法3"不满意,因此我对其进行了一些重新调整更好的性能.现在,它的运行速度比单纯的strcmp快10倍.

UPDATE: I was not satisfied with my earlier "Method 3" so I've just re-jigged it a little to get better performance. It now runs almost 10 times faster than a naive strcmp.

strcmp在我的机器上获胜(Linux Mint 12上的2011b).特别是,它的效果比ismember好得多.但是,如果您自己进行一些手动预分类,则可以稍微提高速度.考虑以下速度测试:

strcmp wins on my machine (2011b on Linux Mint 12). In particular, it works much better than ismember. However, you can gain a bit of an extra speed up if you do some manual presorting yourself. Consider the following speed test:

NumIter = 100;
N = 495000;
K = N / 20;
List = cell(N, 1);
for i = 1:20
    List(i*K - K + 1:i*K) = cellstr(char(i+96));
end

StrToFind = cell(NumIter, 1);
for j = 1:NumIter
    StrToFind{j} = char(round(rand * 20) + 96);
end

%# METHOD 1 (ismember)
tic
for j = 1:NumIter
    Index1 = ismember(List, StrToFind{j});
    Soln1 = List(Index1);
end
toc

%#METHOD 2 (strcmp)
tic
for j = 1:NumIter
    Index2 = strcmp(StrToFind{j}, List);
    Soln2 = List(Index2);
end
toc

%#METHOD 3 (strmp WITH MANUAL PRE-SORTING)    
tic
for j = 1:NumIter
    CurStrToFind = StrToFind{j};
    K = 100;
    I1 = zeros(K, 2); I1(1, :) = ones(1, 2);
    I2 = zeros(K, 2); I2(end, 1) = 1; I2(end, 2) = N;
    KDiv = floor(N/K);
    for k = 2:K-1
        CurSearchNum = k * KDiv;
        CurListItem = List{CurSearchNum};
        if CurListItem < CurStrToFind; I1(k, 1) = 1; end;
        if CurListItem > CurStrToFind; I2(k, 1) = 1; end;
        I1(k, 2) = CurSearchNum; I2(k, 2) = CurSearchNum;
    end
    a = find(I1(:, 1), 1, 'last');
    b = find(I2(:, 1), 1, 'first');
    ShortList = List(I1(a, 2):I2(b, 2));
    Index3 = strcmp(CurStrToFind, ShortList);
    Soln3 = ShortList(Index3);
end
toc

输出为:

Elapsed time is 6.411537 seconds.
Elapsed time is 1.396239 seconds.
Elapsed time is 0.150143 seconds.

这篇关于字母顺序排序列表中的字符串匹配(MATLAB方法)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆