使用m.file字搜索算法 [英] Word search algorithm using an m.file

查看:133
本文介绍了使用m.file字搜索算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经实现了我的算法使用多个字符串的细胞Matlab的,但我似乎无法通过读取文件来做到这一点。

在MATLAB中,我创建的字符串细胞的每一行,我们姑且称之为行。

所以,我得到

 行='字符串1'字符串2等
     行='字串5''string7...
     行= ...
 

等。我有超过线读取数百个。

我想要做的是从第一行比较的话本身。 然后组合第一和第二线,并在第二行到组合单元比较的话。我积累我读和最后一个单元格比较每个单元读取。

下面是我的code对

有每行= A,B,C,D,...

 的(I = 1:长度(一))
为(J = 1:长度(a))的
  AA = ismember(一,一)
  结束

  结合= [A,B]
  [UNC,I] =独特的(相结合,第一)
  排序=结合(排序(I))

  对于(I = 1:长度(排序​​))
为(J = 1:长度(b))的
  AB = ismember(排序,B)
 结束
 结束

 combine1 = [A,B,C]
 

..... 当我看到我的文件,我创建了一个while循环读取整个脚本才结束,所以我我怎么能实现我的算法,如果字符串我所有的细胞具有相同的名称?

 ,而〜的feof(FID)
    OUT = fgetl(FID)
    如果的isEmpty(出)|| STRNCMP(满分,'%',1)||〜ischar(下)
    继续
    结束
    行=正则表达式(行,'','拆')
 

解决方案

假设你的数据文件名为的data.txt ,其内容为:

字符串1字符串2 STRING3串, 字符串2 STRING3 串,STRING5 string6

有一个非常简单的方法,只保留第一个独一无二的发生是:

%解析一切一气呵成 FID = FOPEN('C:\ Users \用户ok1011 \桌面\ data.txt中'); OUT = textscan(FID,'%s'的); fclose函数(FID); 独特(满分{1}) ANS =     字符串1     字符串2     STRING3     串,     STRING5     string6

如已经提到的,这种方法可能无法正常工作,如果:

  • 您的数据文件中有违规行为
  • 您真正需要的比较指数

编辑:为高性能的解决方案

%,散装和拆分解析(假设你不知道最大 %数量的线串的,否则,你可以单独使用textscan) FID = FOPEN('C:\ Users \用户ok1011 \桌面\ data.txt中'); OUT = textscan(FID,'%s'的,'分隔符','\ N'); OUT =正则表达式(满分{1},'','拆'); fclose函数(FID); %preallocate独特的梳子 梳=唯一的([出{:}]); %,你可能需要从这里取出空字符串 %preallocate IDX M =大小(满分,1); IDX =假(米,大小(梳,2)); %循环为行数(行) 为II = 1:M     IDX(二,:) = ismember(梳子,出{二}); 结束

请注意所产生的 IDX 是:

IDX =      1 1 1 1 0 0      0 1 1 0 0 0      0 0 0 1 1 1

其保存在这种形式的好处是,可以节省空间,相对于单元阵列(其中规定112字节每单元的开销)。您也可以将其存储为稀疏阵列存储成本可能提高。

另一个要注意的是,即使在逻辑数组长于例如双阵列中,索引,只要超过元素都是假的,你仍然可以使用它(并通过建筑的上述问题,IDX满足此要求)。 一个例子来阐明:

A = 1:3; A([真假真假假])

I have already implemented my algorithm using cells of multiple strings on Matlab, but I can't seem to do it through reading a file.

On Matlab, I create cells of strings for each line, let's call them line.

So I get

     line= 'string1' 'string2' etc
     line= 'string 5' 'string7'...
     line=...

and so on. I have over 100s of lines to read.

What I'm trying to do is compare the words from to the first line to itself. Then combine the first and second line, and compare the words in the second line to the combined cell. I accumulate each cell I read and compare with the last cell read.

Here is my code on

for each line= a,b,c,d,...

for(i=1:length(a))
for(j=1:length(a))
  AA=ismember(a,a)
  end

  combine=[a,b]
  [unC,i]=unique(combine, 'first')
  sorted=combine(sort(i))

  for(i=1:length(sorted))
for(j=1:length(b))
  AB=ismember(sorted,b)
 end
 end

 combine1=[a,b,c]

..... When I read my file, I create a while loop which reads the whole script until the end, so how I can I implement my algorithm if all my cells of strings have the same name?

    while~feof(fid)
    out=fgetl(fid)
    if isempty(out)||strncmp(out, '%', 1)||~ischar(out)
    continue
    end
    line=regexp(line, ' ', 'split')

解决方案

Suppose your data file is called data.txt and its content is:

string1 string2 string3 string4
string2 string3 
string4 string5 string6

A very easy way to retain only the first unique occurrence is:

% Parse everything in one go
fid = fopen('C:\Users\ok1011\Desktop\data.txt');
out = textscan(fid,'%s');
fclose(fid);

unique(out{1})
ans = 
    'string1'
    'string2'
    'string3'
    'string4'
    'string5'
    'string6'

As already mentioned, this approach might not work if:

  • your data file has irregularities
  • you actually need the comparison indices

EDIT: solution for performance

% Parse in bulk and split (assuming you don't know maximum 
%number of strings in a line, otherwise you can use textscan alone)

fid = fopen('C:\Users\ok1011\Desktop\data.txt');
out = textscan(fid,'%s','Delimiter','\n');
out = regexp(out{1},' ','split');
fclose(fid);

% Preallocate unique comb
comb = unique([out{:}]); % you might need to remove empty strings from here

% preallocate idx
m   = size(out,1);
idx = false(m,size(comb,2));

% Loop for number of lines (rows)
for ii = 1:m
    idx(ii,:) = ismember(comb,out{ii});
end

Note that the resulting idx is:

idx =
     1     1     1     1     0     0
     0     1     1     0     0     0
     0     0     0     1     1     1

The advantage of keeping it in this form is that you save on space with respect to a cell array (which imposes 112 bytes of overhead per cell). You can also store it as a sparse array to potentially improve on storage costs.

Another thing to note, is that even if the logical array is longer than the e.g. double array which is indexing, as long as the exceeding elements are false you can still use it (and by construction of the above problem, idx satisfies this requirement). An example to clarify:

A = 1:3;
A([true false true false false])

这篇关于使用m.file字搜索算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆