在表中查找相似的数字模式 [英] Finding similar number patterns in table

查看:110
本文介绍了在表中查找相似的数字模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好的,假设我们有members表.假设有一个名为about_member的字段.每个人都会有这样的字符串1-1-2-1-2.假设member_1拥有此字符串1-1-2-2-1,并且他搜索谁拥有相似的字符串或尽可能相似的字符串.例如,如果member_2的字符串为1-1-2-2-1,则将100%匹配,但是如果member_3的字符串为2-1-1-2-1,则将为60%匹配.而且必须按匹配百分比排序.使用MYSQL和PHP的最佳方法是什么?真的很难解释我的意思,但也许您明白了,如果没有,问我.谢谢.

Ok, let's suppose we have members table. There is a field called, let's say, about_member. There will be a string like this 1-1-2-1-2 for everybody. Let's suppose member_1 has this string 1-1-2-2-1 and he searches who has the similar string or as much similar as possible. For example if member_2 has string 1-1-2-2-1 it will be 100% match, but if member_3 has string like this 2-1-1-2-1 it will be 60% match. And it has to be ordered by match percent. What is the most optimal way to do it with MYSQL and PHP? It's really hard to explain what I mean, but maybe you got it, if not, ask me. Thanks.

请给我一些没有Levenshtein方法的想法.该答案将得到赏金.谢谢. (我们将在有机会时宣布赏金)

推荐答案

Jawa最初发布了这个想法;这是我的尝试.

Jawa posted this idea originally; here is my attempt.

^是XOR函数.它逐位比较2个二进制数,如果两位相同则返回0,否则返回1.

^ is the XOR function. It compares 2 binary numbers bit-by-bit and returns 0 if both bits are the same, and 1 otherwise.

    0 1 0 0 0 1 0 1 0 1 1 1  (number 1)
 ^  0 1 1 1 0 1 0 1 1 0 1 1  (number 2)
 =  0 0 1 1 0 0 0 0 1 1 0 0  (result)

这如何解决您的问题:

  // In binary...
  1111 ^ 0111 = 1000 // (1 bit out of 4 didn't match: 75% match)
  1111 ^ 0000 = 1111 // (4 bits out of 4 didn't match: 0% match)

  // The same examples, except now in decimal...
    15 ^    7 = 8  (1000 in binary) // (1 bit out of 4 didn't match: 75% match)
    15 ^    0 = 15 (1111 in binary) // (4 bits out of 4 didn't match: 0% match)

我们如何在MySQL中计算这些位数:

How we can count these bits in MySQL:

BIT_COUNT(b'0111') = 3 // Bit count of binary '0111'
BIT_COUNT(7) = 3       // Bit count of decimal 7 (= 0111 in binary)
BIT_COUNT(b'1111' ^ b'0111') = 1 // (1 bit out of 4 didn't match: 75% match)

因此要获得相似性 ...

// First we focus on calculating mismatch.
(BIT_COUNT(b'1111' ^ b'0111') / YOUR_TOTAL_BITS) = 0.25 (25% mismatch)
(BIT_COUNT(b'1111' ^ b'1111') / YOUR_TOTAL_BITS) = 0 (0% mismatch; 100% match)

// Now, getting the proportion of matched bits is easy
1 - (BIT_COUNT(b'1111' ^ b'0111') / YOUR_TOTAL_BITS) = 0.75 (75% match)
1 - (BIT_COUNT(b'1111' ^ b'1111') / YOUR_TOTAL_BITS) = 1.00 (100% match)

如果我们仅使您的about_member字段将数据存储为位(并由整数表示),那么我们可以轻松地完成所有这些操作!代替1-2-1-1-1,使用0-1-0-0-0,但不要使用破折号.

If we could just make your about_member field store data as bits (and be represented by an integer), we could do all of this easily! Instead of 1-2-1-1-1, use 0-1-0-0-0, but without the dashes.

以下是PHP可以如何帮助我们的信息:

Here's how PHP can help us:

bindec('01000') == 8;
bindec('00001') == 1;
decbin(8) == '01000';
decbin(1) == '00001';

最后,这是实现:

// Setting a member's about_member property...
$about_member = '01100101';
$about_member_int = bindec($about_member);
$query = "INSERT INTO members (name,about_member) VALUES ($name,$about_member_int)";

// Getting matches...
$total_bits = 8; // The maximum length the member_about field can be (8 in this example)
$my_member_about = '00101100';
$my_member_about_int = bindec($my_member_about_int);
$query = "
    SELECT 
        *,
        (1 - (BIT_COUNT(member_about ^ $my_member_about_int) / $total_bits)) match 
    FROM members
    ORDER BY match DESC
    LIMIT 10";

最后一个查询将选择与我最相似的10个成员!

This last query will have selected the 10 members most similar to me!

现在,用外行的话来说一下,

我们使用二进制文件是因为它使事情变得简单.二进制数就像一长串的电灯开关.我们要保存电灯开关配置",并查找具有最相似配置的成员.

We use binary because it makes things easier; the binary number is like a long line of light switches. We want to save our "light switch configuration" as well as find members that have the most similar configurations.

^操作员(给定2种灯光开关配置)为我们进行了比较.结果又是一系列的开关.如果两个原始开关位于不同的位置,则开关为ON;如果两个原始开关位于相同的位置,则开关为OFF.

The ^ operator, given 2 light switch configurations, does a comparison for us. The result is again a series of switches; a switch will be ON if the 2 original switches were in different positions, and OFF if they were in the same position.

BIT_COUNT告诉我们ON有多少个开关-给我们计数了多少个不同的开关. YOUR_TOTAL_BITS是开关总数.

BIT_COUNT tells us how many switches are ON--giving us a count of how many switches were different. YOUR_TOTAL_BITS is the total number of switches.

但是二进制数字仍然只是数字...所以1和0的字符串实际上只代表一个数字,例如133或94.但是,如果我们使用十进制数字,则很难形象化我们的电灯开关配置".这就是PHP的decbinbindec出现的地方.

But binary numbers are still just numbers... and so a string of 1's and 0's really just represents a number like 133 or 94. But it's a lot harder to visualize our "light switch configuration" if we use decimal numbers. That's where PHP's decbin and bindec come in.

了解有关二进制数字系统的更多信息.

希望这会有所帮助!

这篇关于在表中查找相似的数字模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆