如何similar_text工作? [英] How does similar_text work?

查看:146
本文介绍了如何similar_text工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚发现了similar_text功能,并与玩了,但输出的百分比总是suprises我。见下文的实施例。

我试图找到所采用的算法信息, PHP提到: similar_text () 文件

 < PHP
$ p值= 0;
similar_text('AAAAAAAAAA','AAAAA',$ p)的;
回声$ P。 <小时>中;
//66.666666666667
//由于5出的10个字符的比赛,我希望50%的比赛similar_text('aaaaaaaaaaaaaaaaaaaa','AAAAA',$ p)的;
回声$ P。 <小时>中;
// 40
// 5出20的制造>不是25%?similar_text('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', AAAAA,$ P);
回声$ P。 <小时>中;
//9.5238095238095
// 5,满分100〕不是5%?
//从PHP.net例
//为什么转身转换结果的字符串?similar_text('PHP是伟大的,与MySQL',$ P);
回声$ P。 <小时>中; //27.272727272727similar_text('与MySQL,PHP是伟大',$ P);
回声$ P。 <小时>中; //18.181818181818?>

有谁能够解释这实际上是如何工作的?

更新:

感谢我发现,这一比例使用类似charactors的数量实际上计算的意见* 200 /长度1 + 2 lenght

  Z_DVAL_PP(百分比)= SIM * 200.0 /(t1_len + t2_len);

所以这解释了为什么percenatges都高于预期。与5出95串事实证明10,这样我就可以使用。

<$p$p><$c$c>similar_text('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', AAAAA,$ P);
回声$ P。 &LT;小时&gt;中;
// 10
// 5出95 = 5 * 200 /(5 + 95)= 10

但我仍然无法弄清楚,为什么PHP返回上扭转弦不同的结果。通过dfsq提供的JS code没有做到这一点。看着在PHP源$ C ​​$ C我只能找到以下行有差别,但我不是一个C程序员。在不同的是什么样的一些见解,将AP preciated。

在JS:

 为(L = 0;(P + L&下; firstLength)及及(Q + L&下; secondLength)及及(first.charAt(对+ 1 )=== second.charAt(q + 1)),L ++);

在PHP中:(php_similar_str功能)

 为(L = 0;(P + L&下; END1)及及(Q + L&下; END2)及及(ρ[L] == q [L]),L ++);

来源:

  / * {{{原INT similar_text(串STR1,字符串STR2 [,浮动百分比])
   计算* /两个字符串之间的相似性
PHP_FUNCTION(similar_text)
{
  字符* T1,T2 *;
  zval的**%的= NULL;
  INT交流= ZEND_NUM_ARGS();
  诠释SIM卡;
  INT t1_len,t2_len;  如果(zend_parse_parameters(ZEND_NUM_ARGS()TSRMLS_CC,SS | Z,&安培; T1,&安培; t1_len,&安培; T2,和放大器; t2_len,&安培;百​​分比)==失败){
    返回;
  }  如果(交流→2){
    convert_to_double_ex(百分比);
  }  如果(t1_len + t2_len == 0){
    如果(交流→2){
      Z_DVAL_PP(百分比)= 0;
    }    RETURN_LONG(0);
  }  SIM = php_similar_char(T1,t1_len,T2,t2_len);  如果(交流→2){
    Z_DVAL_PP(百分比)= SIM * 200.0 /(t1_len + t2_len);
  }  RETURN_LONG(SIM);
}
/ *}}} * /
/ {{{php_similar_str
 * /
静态无效php_similar_str(为const char * TXT1,INT LEN1,为const char * TXT2,诠释LEN2,为int * POS1,为int * POS2,为int *最大值)
{
  字符* P,* Q;
  字符* END1 =(字符*)TXT1 + LEN1;
  字符* END2 =(字符*)TXT2 + LEN2;
  INT升;  *最大= 0;
  为(P =(字符*)TXT1; P&LT; END1; p ++){
    对于(Q =(字符*)TXT2; Q&LT; END2;问++){
      为(L = 0;(P + L&下; END1)及及(Q + L&下; END2)及及(ρ[L] == Q [L]),L ++);
      如果(L&GT *最大){
        *最大= 1;
        * POS1 = P - TXT1;
        * POS2 = Q - TXT2;
      }
    }
  }
}
/ *}}} * /
/ {{{php_similar_char
 * /
静态INT php_similar_char(为const char * TXT1,INT LEN1,为const char * TXT2,INT LEN2)
{
  INT总和;
  INT POS1,POS2,最大;  php_similar_str(TXT1,LEN1,TXT2,LEN2,&安培; POS1,&安培; POS2,&安培;最大);  如果((总和= MAX)){
    如果(POS1&安培;&安培; POS2){
      总和+ = php_similar_char(TXT1,POS1,
                  TXT2,POS2);
    }
    如果((POS1 +最大&下; LEN1)及及(POS2 +最大&下; LEN2)){
      总和+ = php_similar_char(TXT1 + POS1 +最大,LEN1 - POS1 - 最大,
                  TXT2 + POS2 +最大,LEN2 - POS2 - 最大);
    }
  }  返回总和;
}
/ *}}} * /

来源在Javascript:类似的文本端口的JavaScript


解决方案

这的确似乎是功能使用取决于参数顺序的不同的逻辑。我觉得有两个东西在作怪。

首先,看看这个例子:

 回声similar_text(测试,邂逅相遇'); // 1
回声similar_text('邂逅相遇','测试'); // 2

这似乎是它正在测试多少次参数1的不同字符在参数2找到,从而导致如果换成周围的PARAMS会有所不同。据报道,为错​​误,还没有被证实的人。

现在,上面是的相同的两个PHP和JavaScript实现 - paremeter订单都有影响,所以说JS code不会这么做是不对的。我认为这是的可能的争论,随着预期的行为。不知道这是否是

二 - 什么似乎并不正确的是MySQL / PHP字的例子。同的是,JavaScript的版本给3不相干则params的量级,而PHP给出2和3(并且由于是,百分比同样不同)。现在,短语PHP伟大和使用MySQL应该有共同的5个字符,哪种方式无关的你比较:H,I,S和T各一个,外加一个空的空间。为了他们有3个字符,'H',''和'S',所以如果你看一下顺序,正确的答案应该是3左右逢源。我修改了C code到一个可运行的版本,并增加了一些输出,这样人们可以看到正在发生的事情有( codePAD链接):

 #包括LT&;&stdio.h中GT;/ {{{php_similar_str
 * /
静态无效php_similar_str(为const char * TXT1,INT LEN1,为const char * TXT2,诠释LEN2,为int * POS1,为int * POS2,为int *最大值)
{
  字符* P,* Q;
  字符* END1 =(字符*)TXT1 + LEN1;
  字符* END2 =(字符*)TXT2 + LEN2;
  INT升;  *最大= 0;
  为(P =(字符*)TXT1; P&LT; END1; p ++){
    对于(Q =(字符*)TXT2; Q&LT; END2;问++){
      为(L = 0;(P + L&下; END1)及及(Q + L&下; END2)及及(ρ[L] == Q [L]),L ++);
      如果(L&GT *最大){
        *最大= 1;
        * POS1 = P - TXT1;
        * POS2 = Q - TXT2;
      }
    }
  }
}
/ *}}} * /
/ {{{php_similar_char
 * /
静态INT php_similar_char(为const char * TXT1,INT LEN1,为const char * TXT2,INT LEN2)
{
  INT总和;
  INT POS1,POS2,最大;  php_similar_str(TXT1,LEN1,TXT2,LEN2,&安培; POS1,&安培; POS2,&安培;最大);  如果((总和= MAX)){
    如果(POS1&安培;&安培; POS2){
      的printf(TXT这里%S,%S \\ n,TXT1,TXT2);
      总和+ = php_similar_char(TXT1,POS1,
                  TXT2,POS2);
    }
    如果((POS1 +最大&下; LEN1)及及(POS2 +最大&下; LEN2)){
      的printf(TXT这里%S,%S \\ n,TXT1 + POS1 +最大,TXT2 + POS2 +最大);
      总和+ = php_similar_char(TXT1 + POS1 +最大,LEN1 - POS1 - 最大,
                  TXT2 + POS2 +最大,LEN2 - POS2 - 最大);
    }
  }  返回总和;
}
/ *}}} * /
INT主要(无效)
{
    的printf(发现%d个字符类似的\\ n,
        php_similar_char(PHP伟大,12,与MySQL,10));
    的printf(发现%d个字符类似的\\ n,
        php_similar_char(与MySQL,10,PHP是伟大的,12));
    返回0;
}

输出结果:

  TXT这里PHP是伟大的,与MySQL
TXT这里P是伟大的,MYSQL
TXT这里是伟大的,MYSQL
TXT这里是伟大的,MYSQL
TXT这里太好了,QL
找到3相似的字符
TXT这里的MySQL,PHP是伟大的
TXT这里TH MYSQL,太棒了
找到2相似字符

因此​​,人们可以看到,在第一个比较,发现功能'H',''和'S',而不是'T',并得到了3,结果第二个对比发现'我'和'T 但并非'H',''或'S',并由此得到了2的结果。

的原因这些结果可以从输出可以看出:算法以使得第二字符串包含第一串中的第一个字母,会计算,和扔掉的字符之前,从第二串的。这就是为什么它忽略了介于两者之间的人物,那就是东西导致差异,当您更改字符顺序。

会发生什么情况有可能是故意的,也可能不是。然而,这并不版的JavaScript是如何工作的。如果您在JavaScript版本打印出来的一样东西,你会得到这样的:

  TXT这里:PHP,WIT
TXT这里:P是伟大的,MYSQL
TXT这里:是伟大的,MYSQL
TXT这里:IS,MY
TXT这里:太好了,QL
找到3相似的字符
TXT这里:WITH,PHP
TXT这里:W,P
TXT这里:TH MYSQL,太棒了
找到3相似的字符

显示,JavaScript版本做它用不同的方式。什么的JavaScript版本确实是它找到H,和S在第一比较相同的顺序是,同样的H,和S也对第二个 - 所以在这种情况下,则params的顺序并不重要。

我要说的却是JavaScript版本是这样做的更正确的方法,但是这是为炒作。在任何情况下,JavaScript的是为了复制PHP函数的code,它需要的行为相同 - 这就是为什么我提交基于@Khez和修复的分析错误报告。荣誉那里。

I just found the similar_text function and was playing around with it, but the percentage output always suprises me. See the examples below.

I tried to find information on the algorithm used as mentioned on php: similar_text()Docs:

<?php
$p = 0;
similar_text('aaaaaaaaaa', 'aaaaa', $p);
echo $p . "<hr>";
//66.666666666667
//Since 5 out of 10 chars match, I would expect a 50% match

similar_text('aaaaaaaaaaaaaaaaaaaa', 'aaaaa', $p);
echo $p . "<hr>";
//40
//5 out of 20 > not 25% ?

similar_text('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaa', $p);
echo $p . "<hr>"; 
//9.5238095238095 
//5 out of 100 > not 5% ?


//Example from PHP.net
//Why is turning the strings around changing the result?

similar_text('PHP IS GREAT', 'WITH MYSQL', $p);
echo $p . "<hr>"; //27.272727272727

similar_text('WITH MYSQL', 'PHP IS GREAT', $p);
echo $p . "<hr>"; //18.181818181818

?>

Can anybody explain how this actually works?

Update:

Thanks to the comments I found that the percentage is actually calculated using the number of similar charactors * 200 / length1 + lenght 2

Z_DVAL_PP(percent) = sim * 200.0 / (t1_len + t2_len);

So that explains why the percenatges are higher then expected. With a string with 5 out of 95 it turns out 10, so that I can use.

similar_text('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaa', $p);
echo $p . "<hr>"; 
//10
//5 out of 95 = 5 * 200 / (5 + 95) = 10

But I still cant figure out why PHP returns a different result on turning the strings around. The JS code provided by dfsq doesn't do this. Looking at the source code in PHP I can only find a difference in the following line, but i'm not a c programmer. Some insight in what the difference is, would be appreciated.

In JS:

for (l = 0;(p + l < firstLength) && (q + l < secondLength) && (first.charAt(p + l) === second.charAt(q + l)); l++);

In PHP: (php_similar_str function)

for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);

Source:

/* {{{ proto int similar_text(string str1, string str2 [, float percent])
   Calculates the similarity between two strings */
PHP_FUNCTION(similar_text)
{
  char *t1, *t2;
  zval **percent = NULL;
  int ac = ZEND_NUM_ARGS();
  int sim;
  int t1_len, t2_len;

  if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ss|Z", &t1, &t1_len, &t2, &t2_len, &percent) == FAILURE) {
    return;
  }

  if (ac > 2) {
    convert_to_double_ex(percent);
  }

  if (t1_len + t2_len == 0) {
    if (ac > 2) {
      Z_DVAL_PP(percent) = 0;
    }

    RETURN_LONG(0);
  }

  sim = php_similar_char(t1, t1_len, t2, t2_len);

  if (ac > 2) {
    Z_DVAL_PP(percent) = sim * 200.0 / (t1_len + t2_len);
  }

  RETURN_LONG(sim);
}
/* }}} */ 


/* {{{ php_similar_str
 */
static void php_similar_str(const char *txt1, int len1, const char *txt2, int len2, int *pos1, int *pos2, int *max)
{
  char *p, *q;
  char *end1 = (char *) txt1 + len1;
  char *end2 = (char *) txt2 + len2;
  int l;

  *max = 0;
  for (p = (char *) txt1; p < end1; p++) {
    for (q = (char *) txt2; q < end2; q++) {
      for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);
      if (l > *max) {
        *max = l;
        *pos1 = p - txt1;
        *pos2 = q - txt2;
      }
    }
  }
}
/* }}} */


/* {{{ php_similar_char
 */
static int php_similar_char(const char *txt1, int len1, const char *txt2, int len2)
{
  int sum;
  int pos1, pos2, max;

  php_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max);

  if ((sum = max)) {
    if (pos1 && pos2) {
      sum += php_similar_char(txt1, pos1,
                  txt2, pos2);
    }
    if ((pos1 + max < len1) && (pos2 + max < len2)) {
      sum += php_similar_char(txt1 + pos1 + max, len1 - pos1 - max,
                  txt2 + pos2 + max, len2 - pos2 - max);
    }
  }

  return sum;
}
/* }}} */

Source in Javascript: similar text port to javascript

解决方案

It would indeed seem the function uses different logic depending of the parameter order. I think there are two things at play.

First, see this example:

echo similar_text('test','wert'); // 1
echo similar_text('wert','test'); // 2

It seems to be that it is testing "how many times any distinct char on param1 is found in param2", and thus result would be different if you swap the params around. It has been reported as a bug, which hasn't been confirmed by anyone.

Now, the above is the same for both PHP and javascript implementations - paremeter order has an impact, so saying that JS code wouldn't do this is wrong. I think it is possible to argue that as intended behaviour. Not sure if it is.

Second - what doesn't seem correct is the MYSQL/PHP word example. With that, javascript version gives 3 irrelevant of the order of params, whereas PHP gives 2 and 3 (and due to that, percentage is equally different). Now, the phrases "PHP IS GREAT" and "WITH MYSQL" should have 5 characters in common, irrelevant of which way you compare: H, I, S and T, one each, plus one for empty space. In order they have 3 characters, 'H', ' ' and 'S', so if you look at the ordering, correct answer should be 3 both ways. I modified the C code to a runnable version, and added some output, so one can see what is happening there (codepad link):

#include<stdio.h>

/* {{{ php_similar_str
 */
static void php_similar_str(const char *txt1, int len1, const char *txt2, int len2, int *pos1, int *pos2, int *max)
{
  char *p, *q;
  char *end1 = (char *) txt1 + len1;
  char *end2 = (char *) txt2 + len2;
  int l;

  *max = 0;
  for (p = (char *) txt1; p < end1; p++) {
    for (q = (char *) txt2; q < end2; q++) {
      for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);
      if (l > *max) {
        *max = l;
        *pos1 = p - txt1;
        *pos2 = q - txt2;
      }
    }
  }
}
/* }}} */


/* {{{ php_similar_char
 */
static int php_similar_char(const char *txt1, int len1, const char *txt2, int len2)
{
  int sum;
  int pos1, pos2, max;

  php_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max);

  if ((sum = max)) {
    if (pos1 && pos2) {
      printf("txt here %s,%s\n", txt1, txt2);
      sum += php_similar_char(txt1, pos1,
                  txt2, pos2);
    }
    if ((pos1 + max < len1) && (pos2 + max < len2)) {
      printf("txt here %s,%s\n", txt1+ pos1 + max, txt2+ pos2 + max);
      sum += php_similar_char(txt1 + pos1 + max, len1 - pos1 - max,
                  txt2 + pos2 + max, len2 - pos2 - max);
    }
  }

  return sum;
}
/* }}} */
int main(void)
{
    printf("Found %d similar chars\n",
        php_similar_char("PHP IS GREAT", 12, "WITH MYSQL", 10));
    printf("Found %d similar chars\n",
        php_similar_char("WITH MYSQL", 10,"PHP IS GREAT", 12));
    return 0;
}

the result is output:

txt here PHP IS GREAT,WITH MYSQL
txt here P IS GREAT, MYSQL
txt here IS GREAT,MYSQL
txt here IS GREAT,MYSQL
txt here  GREAT,QL
Found 3 similar chars
txt here WITH MYSQL,PHP IS GREAT
txt here TH MYSQL,S GREAT
Found 2 similar chars

So one can see that on the first comparison, the function found 'H', ' ' and 'S', but not 'T', and got the result of 3. The second comparison found 'I' and 'T' but not 'H', ' ' or 'S', and thus got the result of 2.

The reason for these results can be seen from the output: algorithm takes the first letter in the first string that second string contains, counts that, and throws away the chars before that from the second string. That is why it misses the characters in-between, and that's the thing causing the difference when you change the character order.

What happens there might be intentional or it might not. However, that's not how javascript version works. If you print out the same things in the javascript version, you get this:

txt here: PHP, WIT
txt here: P IS GREAT,  MYSQL
txt here: IS GREAT, MYSQL
txt here: IS, MY
txt here:  GREAT, QL
Found 3 similar chars
txt here: WITH, PHP 
txt here: W, P
txt here: TH MYSQL, S GREAT
Found 3 similar chars

showing that javascript version does it in a different way. What the javascript version does is that it finds 'H', ' ' and 'S' being in the same order in the first comparison, and the same 'H', ' ' and 'S' also on the second one - so in this case the order of params doesn't matter.

I'd argue that javascript version is more correct way of doing it, but that's up for speculation. In any case, as the javascript is meant to duplicate the code of PHP function, it needs to behave identically - which is why I submitted bug report based on analysis of @Khez and the fix. Kudos there.

这篇关于如何similar_text工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆