在表中查找相似的数字模式 [英] Finding similar number patterns in table
问题描述
好的,假设我们有members
表.假设有一个名为about_member
的字段.每个人都会有这样的字符串1-1-2-1-2
.假设member_1拥有此字符串1-1-2-2-1
,并且他搜索谁拥有相似的字符串或尽可能相似的字符串.例如,如果member_2的字符串为1-1-2-2-1
,则将100%匹配,但是如果member_3的字符串为2-1-1-2-1
,则将为60%匹配.而且必须按匹配百分比排序.使用MYSQL和PHP的最佳方法是什么?真的很难解释我的意思,但也许您明白了,如果没有,问我.谢谢.
Ok, let's suppose we have members
table. There is a field called, let's say, about_member
. There will be a string like this 1-1-2-1-2
for everybody. Let's suppose member_1 has this string 1-1-2-2-1
and he searches who has the similar string or as much similar as possible. For example if member_2 has string 1-1-2-2-1
it will be 100% match, but if member_3 has string like this 2-1-1-2-1
it will be 60% match. And it has to be ordered by match percent. What is the most optimal way to do it with MYSQL and PHP? It's really hard to explain what I mean, but maybe you got it, if not, ask me. Thanks.
请给我一些没有Levenshtein方法的想法.该答案将得到赏金.谢谢. (我们将在有机会时宣布赏金)
推荐答案
Jawa最初发布了这个想法;这是我的尝试.
Jawa posted this idea originally; here is my attempt.
^
是XOR函数.它逐位比较2个二进制数,如果两位相同则返回0,否则返回1.
^
is the XOR function. It compares 2 binary numbers bit-by-bit and returns 0 if both bits are the same, and 1 otherwise.
0 1 0 0 0 1 0 1 0 1 1 1 (number 1)
^ 0 1 1 1 0 1 0 1 1 0 1 1 (number 2)
= 0 0 1 1 0 0 0 0 1 1 0 0 (result)
这如何解决您的问题:
// In binary...
1111 ^ 0111 = 1000 // (1 bit out of 4 didn't match: 75% match)
1111 ^ 0000 = 1111 // (4 bits out of 4 didn't match: 0% match)
// The same examples, except now in decimal...
15 ^ 7 = 8 (1000 in binary) // (1 bit out of 4 didn't match: 75% match)
15 ^ 0 = 15 (1111 in binary) // (4 bits out of 4 didn't match: 0% match)
我们如何在MySQL中计算这些位数:
How we can count these bits in MySQL:
BIT_COUNT(b'0111') = 3 // Bit count of binary '0111'
BIT_COUNT(7) = 3 // Bit count of decimal 7 (= 0111 in binary)
BIT_COUNT(b'1111' ^ b'0111') = 1 // (1 bit out of 4 didn't match: 75% match)
因此要获得相似性 ...
// First we focus on calculating mismatch.
(BIT_COUNT(b'1111' ^ b'0111') / YOUR_TOTAL_BITS) = 0.25 (25% mismatch)
(BIT_COUNT(b'1111' ^ b'1111') / YOUR_TOTAL_BITS) = 0 (0% mismatch; 100% match)
// Now, getting the proportion of matched bits is easy
1 - (BIT_COUNT(b'1111' ^ b'0111') / YOUR_TOTAL_BITS) = 0.75 (75% match)
1 - (BIT_COUNT(b'1111' ^ b'1111') / YOUR_TOTAL_BITS) = 1.00 (100% match)
如果我们仅使您的about_member
字段将数据存储为位(并由整数表示),那么我们可以轻松地完成所有这些操作!代替1-2-1-1-1
,使用0-1-0-0-0
,但不要使用破折号.
If we could just make your about_member
field store data as bits (and be represented by an integer), we could do all of this easily! Instead of 1-2-1-1-1
, use 0-1-0-0-0
, but without the dashes.
以下是PHP可以如何帮助我们的信息:
Here's how PHP can help us:
bindec('01000') == 8;
bindec('00001') == 1;
decbin(8) == '01000';
decbin(1) == '00001';
最后,这是实现:
// Setting a member's about_member property...
$about_member = '01100101';
$about_member_int = bindec($about_member);
$query = "INSERT INTO members (name,about_member) VALUES ($name,$about_member_int)";
// Getting matches...
$total_bits = 8; // The maximum length the member_about field can be (8 in this example)
$my_member_about = '00101100';
$my_member_about_int = bindec($my_member_about_int);
$query = "
SELECT
*,
(1 - (BIT_COUNT(member_about ^ $my_member_about_int) / $total_bits)) match
FROM members
ORDER BY match DESC
LIMIT 10";
最后一个查询将选择与我最相似的10个成员!
This last query will have selected the 10 members most similar to me!
现在,用外行的话来说一下,
我们使用二进制文件是因为它使事情变得简单.二进制数就像一长串的电灯开关.我们要保存电灯开关配置",并查找具有最相似配置的成员.
We use binary because it makes things easier; the binary number is like a long line of light switches. We want to save our "light switch configuration" as well as find members that have the most similar configurations.
^
操作员(给定2种灯光开关配置)为我们进行了比较.结果又是一系列的开关.如果两个原始开关位于不同的位置,则开关为ON
;如果两个原始开关位于相同的位置,则开关为OFF
.
The ^
operator, given 2 light switch configurations, does a comparison for us. The result is again a series of switches; a switch will be ON
if the 2 original switches were in different positions, and OFF
if they were in the same position.
BIT_COUNT
告诉我们ON
有多少个开关-给我们计数了多少个不同的开关. YOUR_TOTAL_BITS
是开关总数.
BIT_COUNT
tells us how many switches are ON
--giving us a count of how many switches were different. YOUR_TOTAL_BITS
is the total number of switches.
但是二进制数字仍然只是数字...所以1和0的字符串实际上只代表一个数字,例如133或94.但是,如果我们使用十进制数字,则很难形象化我们的电灯开关配置".这就是PHP的decbin
和bindec
出现的地方.
But binary numbers are still just numbers... and so a string of 1's and 0's really just represents a number like 133 or 94. But it's a lot harder to visualize our "light switch configuration" if we use decimal numbers. That's where PHP's decbin
and bindec
come in.
希望这会有所帮助!
这篇关于在表中查找相似的数字模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!