如何计算两个句子(句法和语义)之间的相似度 [英] How to compute similarity between two sentences (syntactical and semantical)

查看:368
本文介绍了如何计算两个句子(句法和语义)之间的相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我应该每次取两个句子,然后计算它们是否相似.类似地,我的意思是语法上和语义上.

I'm supposed to take two sentences each time and compute if they are similar. By similar I mean, both syntactically and semantically.

输入1:奥巴马签署法律. 奥巴马签署了一项新法律.

INPUT1: Obama signs the law. A new law is signed by Obama.

INPUT2: 公共汽车在这里停了. 一辆车停在这里.

INPUT2: A Bus is stopped here. A vehicle stops here.

INPUT3:纽约大火. 纽约被烧毁了.

INPUT3: Fire in NY. NY is burnt down.

INPUT4:纽约大火. 50人在纽约大火中丧生.

INPUT4: Fire in NY. 50 died in NY fire.

我不想将本体树当作灵魂.我写了一段代码来计算句子之间的 Levenshtein距离(LD),然后确定第二句是否:

I don't want to use ontology tree as a soul. I wrote a code to compute Levenshtein distance (LD) between sentences and then decide if the 2nd sentence:

  • 可以忽略(INPUT1和2),
  • 应替换第一句(输入3),或
  • 与第一个句子(INPUT4)一起存储.

我对代码不满意,因为LD仅计算语法级别(还有哪些其他方法?).语义如何合并(就像公共汽车有点像车辆?).

I'm not happy with the code as LD only computes syntactical level (what other methods ?). How can semantic be incorporated (like bus is sort of a vehicle?) .

代码在这里:

%# As the difference is computed, a decision is made on the new event
%# (string 2) to be ignored, to replace existing event (string 1) or to be
%# stored separately. The higher the LD metric, the higher the difference
%# between two strings. Of course, lower difference indices either identical
%# or similar events. However, the higher difference indicates the new event
%# as a fresh event.

%#.........................................................................
%# Calculating the LD between two strings of events.
%#.........................................................................
L1=length(str1)+1;
L2=length(str2)+1;
L=zeros(L1,L2);   %# Initializing the new length.

g=+1;             %# just constant
m=+0;             %# match is cheaper, we seek to minimize
d=+1;             %# not-a-match is more costly.

% do BC's
L(:,1)=([0:L1-1]*g)';
L(1,:)=[0:L2-1]*g;

m4=0;             %# loop invariant
%# Calculating required edits.
for idx=2:L1;
    for idy=2:L2
        if(str1(idx-1)==str2(idy-1))
            score=m;
        else
            score=d;
        end
        m1=L(idx-1,idy-1) + score;
        m2=L(idx-1,idy) + g;
        m3=L(idx,idy-1) + g;
        L(idx,idy)=min(m1,min(m2,m3)); % only minimum edits allowed.
    end
end
%# The LD between two strings.
D=L(L1,L2);

%#....................................................................
%# Making decision on what to do with the new event (string 2).
%#...................................................................
if (D<=4)     %# Distance is so less that string 2 seems identical to string 1.
    store=str1;        %# Hence string 2 is ignored. String 1 remains stored.
elseif (D>=5 && D<=15) %# Distance is larger to be identical but not enough to
    %# make string 2 an individual event.
    store= str2;       %# String 2 is somewhat similar to string 1.
                       %# So, string 1 is replaced with string 2 and stored.
else
    %# For all other distances, string 2 is stored along with string 1.
    store={str1; str2};
end

感谢您的帮助.

推荐答案

语义上". 没有简单的教科书算法.自然语言(尤其是英语)是一种非常复杂且善变的野兽.让我们看看所提供的案例中的一小部分:

"Semantically". No simple text-book algorithm for that. Natural language (esp. English) is a very complicated and fickle beast. Let's look at (just a small part of) the provided cases:

INPUT1: Obama signs the law. A new law is signed by Obama.

签署法律使其成为新"法律.

Signing a law makes it a 'new' law.

INPUT2: A Bus is stopped here. A vehicle stops here.

需要知道公交车是否与车辆以及某种时间关系有关.另外,如果公共汽车 did 停止但通常不停止或不再停止怎么办?可以采取几种方法.

Need to know a bus is a type if vehicle as well as some sort of time relation. Also, what if the bus did stop but does not normally stop or is no longer stopped? It can be taken several ways.

INPUT3: Fire in NY. NY is burnt down.

需要知道大火可以燃烧东西.

Need to know that fires can burn things down.

INPUT4: Fire in NY. 50 died in NY fire.

需要知道大火会杀死人(见下).需要将新闻标题"(50 WHAT?)与人们相关联.大脑在某种程度上可以做到这一点.电脑程序不是大脑.

Need to know that fires can kill things (see next). Need to associated the "news headline" (50 WHAT?) with people. The brain can do this somewhat trivially. Computer programs are not brains.

我不是英语专业的学生:-)

And I'm no English major :-)

这篇关于如何计算两个句子(句法和语义)之间的相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆