在MATLAB中对文本进行聚类 [英] Clustering text in MATLAB

查看:324
本文介绍了在MATLAB中对文本进行聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在MATLAB中对文本进行分层的聚集聚类.说,我有四个句子,

I have a pen.
I have a paper. 
I have a pencil.
I have a cat. 

我想对以上四个句子进行聚类,看看哪个更相似.我知道统计工具箱中的命令如pdist来测量成对距离,linkage来计算聚类相似度等.简单的代码如:

X=[1 2; 2 3; 1 4];
Y=pdist(X, 'euclidean');
Z=linkage(Y, 'single');
H=dendrogram(Z)

工作正常并返回树状图.

我想知道我可以在上面提到的文本上使用这些命令吗?有什么想法吗 ?


更新:

感谢Amro.阅读理解并计算字符串之间的距离.代码如下:

clc
S1='I have a pen'; % first String

f_id=fopen('events.txt','r'); %saved strings to compare with
events=textscan(f_id, '%s', 'Delimiter', '\n');
fclose(f_id); %close file.
events=events{1}; % saving the text read.

ii=numel(events); % selects one text randomly.
% store the texts in a cell array

for kk=1:ii

   S2=events(kk);
   S2=cell2mat(S2);
   Z=levenshtein_distance(S1,S2);
   X(kk)=Z;

end 

我输入了一个字符串,并保存了4个字符串.现在,我使用levenshtein_distance函数计算了成对距离.它返回矩阵X=[ 17 0 16 18 16].

**我猜这是我的成对距离矩阵.与pdist相似.是吗?

**现在,我正在尝试输入X来计算

之类的链接

Z=linkage(X, 'single);

我得到的输出是:

使用==>链接时出现错误,大小为93 Y与的输出不兼容 PDIST功能.

错误==> Untitled2 at 20 Z = linkage(X,'single').

为什么这样?可以使用联动功能吗?帮助表示赞赏.

更新2

clc
S1='I have a pen';

f_id=fopen('events.txt','r');
events=textscan(f_id, '%s', 'Delimiter', '\n');
fclose(f_id); %close file.
events=events{1}; % saving the text read.

ii=numel(events)+1; % total number of strings in the comparison

D=zeros(ii, ii); % initialized distance matrix;
for kk=1:ii 

    S2=events(kk);

    %S2=cell2mat(S2);

    for jk=kk+1:ii

  D(kk,jk)= levenshtein_distance(S1{kk},S2{jk});

    end

end

D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)
D = squareform(D, 'tovector');

T = linkage(D, 'single');
dendrogram(T).

错误:???来自非单元格数组对象的单元格内容引用. 错误==> Untitled2 at 22 D(kk,jk)= levenshtein_distance(S1 {kk},S2 {jk});

此外,为什么我要在第一个循环内从文件中读取事件?似乎不合逻辑.有点困惑,如果我可以这种方式工作,或者唯一的解决方案是在代码中输入所有字符串.帮助非常感谢.

更新

比较两个句子的代码:

clc
    str1 = 'Fire in NY';
    str2= 'Jeff is sick';

D=levenshtein_distance(str1,str2);
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)

%D = squareform(D, 'tovector');

T = linkage(D, 'complete');
[H,P] = dendrogram(T,'colorthreshold','default');  

输出D = 18.

使用不同的字符串:

clc
str1 = 'Fire in NY';
str2= 'NY catches fire';

D=levenshtein_distance(str1,str2);
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)

%D = squareform(D, 'tovector');

T = linkage(D, 'complete');
[H,P] = dendrogram(T,'colorthreshold','default'); 

D = 28.

基于距离,一个完全不同的句子看起来很相似.我要尝试做的是,如果我在纽约存储了 Fire ,我就不会存储 NY catches fire .但是,对于第一种情况,由于信息是新的,因此我将进行存储.

LD是否足以做到这一点?帮助表示赞赏.

解决方案

您需要的是一个可以处理字符串的距离函数.查看 Levenshtein距离(编辑距离).有很多实现方式:

或者,您应该提取一些有趣的功能(例如:元音的数量,字符串的长度等)来构建向量空间表示形式,然后可以应用任何常用的距离度量(欧几里得,...)在新的表示形式上.


编辑

您的代码存在的问题是 LINKAGE 需要输入距离格式以匹配 PDIST 的格式行向量,对应于成对的观察对,顺序为1-vs-2、1-vs-3、2-vs-3等.这基本上是完整距离矩阵的下半部分(因为它假定是对称的dist(1,2) == dist(2,1))

%# instances
str = {'I have a pen.'
    'I have a paper.'
    'I have a pencil.'
    'I have a cat.'};
numStr = numel(str);

%# create and fill upper half only of distance matrix
D = zeros(numStr,numStr);
for i=1:numStr
    for j=i+1:numStr
        D(i,j) = levenshtein_distance(str{i},str{j});
    end
end
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)
D = squareform(D, 'tovector');

T = linkage(D, 'single');
dendrogram(T)

请参阅所涉及功能的文档以获取更多信息...

I want to do hierarchical agglomerative clustering on texts in MATLAB. Say, I have four sentences,

I have a pen.
I have a paper. 
I have a pencil.
I have a cat. 

I want to cluster the above four sentences to see which are more similar. I know Statistic toolbox has command like pdist to measure pair-wise distances, linkage to calculate the cluster similarity etc. A simple code like:

X=[1 2; 2 3; 1 4];
Y=pdist(X, 'euclidean');
Z=linkage(Y, 'single');
H=dendrogram(Z)

works fine and return a dendrogram.

I wonder can I use these command on the texts as I mentioned above. Any thoughts ?


UPDATES:

Thanks to Amro. Read Understood and computed the distance among strings. Code follows:

clc
S1='I have a pen'; % first String

f_id=fopen('events.txt','r'); %saved strings to compare with
events=textscan(f_id, '%s', 'Delimiter', '\n');
fclose(f_id); %close file.
events=events{1}; % saving the text read.

ii=numel(events); % selects one text randomly.
% store the texts in a cell array

for kk=1:ii

   S2=events(kk);
   S2=cell2mat(S2);
   Z=levenshtein_distance(S1,S2);
   X(kk)=Z;

end 

I input a string and I had 4 saved strings. Now I calculated the pairwise distance using levenshtein_distance function. It returns a matrix X=[ 17 0 16 18 16].

** I guess this is my pair wise distance matrix. Similar to what pdist does. Is it ?

** Now, I'm trying to input X to compute the linkage like

Z=linkage(X, 'single);

Output I'm getting is:

Error using ==> linkage at 93 Size of Y not compatible with the output of the PDIST function.

Error in ==> Untitled2 at 20 Z=linkage(X,'single') .

Why so ? Can use the linkage function at all ? Help appreciated.

UPDATE 2

clc
S1='I have a pen';

f_id=fopen('events.txt','r');
events=textscan(f_id, '%s', 'Delimiter', '\n');
fclose(f_id); %close file.
events=events{1}; % saving the text read.

ii=numel(events)+1; % total number of strings in the comparison

D=zeros(ii, ii); % initialized distance matrix;
for kk=1:ii 

    S2=events(kk);

    %S2=cell2mat(S2);

    for jk=kk+1:ii

  D(kk,jk)= levenshtein_distance(S1{kk},S2{jk});

    end

end

D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)
D = squareform(D, 'tovector');

T = linkage(D, 'single');
dendrogram(T).

Error: ??? Cell contents reference from a non-cell array object. Error in ==> Untitled2 at 22 D(kk,jk)= levenshtein_distance(S1{kk},S2{jk});

Also, Why am I reading the event from the file inside the first loop ? Doesn't seem logical. Bit confused, if I can work this way or only solution is to input all strings inside the code. Help much appreciated.

UPDATE

code to compare two sentences:

clc
    str1 = 'Fire in NY';
    str2= 'Jeff is sick';

D=levenshtein_distance(str1,str2);
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)

%D = squareform(D, 'tovector');

T = linkage(D, 'complete');
[H,P] = dendrogram(T,'colorthreshold','default');  

Output D=18.

WITH Different strings:

clc
str1 = 'Fire in NY';
str2= 'NY catches fire';

D=levenshtein_distance(str1,str2);
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)

%D = squareform(D, 'tovector');

T = linkage(D, 'complete');
[H,P] = dendrogram(T,'colorthreshold','default'); 

D=28.

Based on distance, a completely different sentence looks similar. What I'm trying to do, If I have stored Fire in NY, I wont store NY catches fire. However, for the first case, I would store as the information is new.

IS LD sufficient to do this ? Help appreciated.

解决方案

What you need is a distance function that can handle strings. Check out the Levenshtein distance (edit distance). There are plenty of implementations out there:

Alternatively, you should extract some interesting features (ex: number of vowels, length of string, etc..) to build a vector space representation, then you can apply any of the usual distance measures (euclidean, ...) on the new representation.


EDIT

The problem with your code is that LINKAGE expects the input distances format to match that of PDIST, namely a row vector corresponding to pairs of observations in the order 1-vs-2, 1-vs-3, 2-vs-3, etc.. which is basically the lower half of the complete distance matrix (since its supposed to be symmetric as dist(1,2) == dist(2,1))

%# instances
str = {'I have a pen.'
    'I have a paper.'
    'I have a pencil.'
    'I have a cat.'};
numStr = numel(str);

%# create and fill upper half only of distance matrix
D = zeros(numStr,numStr);
for i=1:numStr
    for j=i+1:numStr
        D(i,j) = levenshtein_distance(str{i},str{j});
    end
end
D = D + D';       %'# symmetric distance matrix

%# linkage expects the output format to match that of pdist,
%# so we convert D to a row vector (lower/upper part of matrix)
D = squareform(D, 'tovector');

T = linkage(D, 'single');
dendrogram(T)

Please refer to the documentation of the functions in question for more information...

这篇关于在MATLAB中对文本进行聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆