模式与mysql之间两个表列比较 [英] Pattern comparing with mysql between two tables column

查看:171
本文介绍了模式与mysql之间两个表列比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一个简单的问题是PHP中的 preg_match 和mysql查询中的 like 是否相同?



主要问题:



考虑下面是我的两个表table1和table2



表1                ;                          ;                          ;   表2

 
+ ------- + ---------- --------------- + + ------- + ------------------------- ----- +
| ID |模型| | ID |模型|
+ ------- + ------------------------- + + ------- + - ---------------------------- +
| 1 | iPad 2 WiFi 16GB | | 1 | iPad2 WiFi 16GB |
| 2 | iPhone 4S 16GB | | 2 | iPhone4S 16GB |
| 3 | iPod Touch(第4代)8GB | | 3 | iPod Touch第4代8GB |
+ ------- + ------------------------- + + ------- + - ---------------------------- +

现在我想做的是比较这两个表,你可以看到 iPad 2 WiFi 16GB iPad2 WiFi 16GB iPod Touch(第4代)8GB iPod Touch第4代8GB 都是相同的,放在我的查询其中Table1.model = Table2.model ,因为他们不是完全匹配。我想做的是使用 like 或任何其他方式比较这些行与mysql查询,所以它会比较两个表行是相似的。请让我知道如何写这样的sql查询。



我试过下面的sql查询,但它没有返回所有的行像didnot返回那些类型的行,在上面的例子中。

  SELECT table1.model as model1,table2.model as model2 
FROM table1,table2 WHERE table1.model REGEXP table2.model


解决方案

描述标准(描述不改变),还是由用户输入?如果是标准的,请添加一个整数列,然后在此列上进行比较。



如果用户输入的是您的工作,寻找更模糊的搜索的东西。我使用一个二元语法搜索算法来排列两个字符串之间的相似性,但是这不能直接在mySQL中。



代替模糊搜索,你可以使用LIKE,但是它的效率仅限于执行表扫描,如果你最终将'%'放在搜索项的开头。此外,它意味着您可以在选择的子字符串部分获得匹配,这意味着您需要提前知道子字符串。



我很高兴



编辑1:好的,考虑到你的阐述,你需要做一个模糊风格搜索,如我所提到的。我使用一个bi-gram方法,它涉及到每个用户作出的条目,并将它分成2或3个字符的块。



例如:

p>描述1:快速向前
描述2:短期向前



如果将每个分成两个字符块 - 'f','fa','as','st'.....



然后你可以比较两个字符块的数量,得到一个分数,这将意味着两者之间的准确性或相似性。



鉴于我不知道你使用的是什么开发语言,我将把实现离开,但这是需要做的, mySQL。



或者,懒惰的替代方法是使用像亚马逊这样的云搜索服务,它将根据您提供的术语提供搜索...不确定他们是否允许你可以不断地添加新的描述来考虑,根据你的应用,它可能有点贵(IMHO)。



R



对于bigram实现中的另一个SO帖子 - 请参阅这个二字母/模糊搜索



---更新每个问题详解---



阅读我提供的链接的理论。第二,我会尽量保持它作为DB不可知的,因为它不需要mySQL(虽然我使用它,它的工作比罚款)



好吧,所以只有当可能的匹配相对较小时,bigram方法才能在内存数组中进行/比较,否则它会受到表扫描性能的影响,就像没有索引的mysql表相当快。因此,您将使用数据库优势来帮助您完成索引。



您需要的是一个表,用于保存用户输入的terms或text你正在寻找比较。最简单的形式是一个具有两列的表,一个是将被索引的唯一自动递增整数,我们将在下面调用hd_id,第二个是varchar(255),如果字符串很短,或者TEXT,如果他们可以得到长 - 你可以命名这个任何你想要的。



然后,您需要创建另一个表,其中至少有三个列 - 一个用于引用列,返回到另一个表的自动递增列调用这个hd_id下面),第二个将是一个varchar()最多说5个字符(这将保存您的bigram块),我们将称为bigram下面,第三个自动递增列b_id下面。此表将保存每个用户条目的所有两字组,并绑定到整个条目。您需要自己对varchar列建立索引(或者在复合索引中按顺序排列)。



现在,每当用户输入要搜索的字词,则需要在第一个表中输入该术语,然后将该术语解析为bigram,并使用第一个表中的整体术语的引用将每个块输入第二个表,以完成关系。这样,你在PHP中做了解剖,但让我的SQL或任何数据库为你做索引优化。在计算阶段,它可以帮助在双字组阶段存储在表1中制作的双字母组的数量。下面是PHP中的一些代码,让你了解如何创建bigrams:

  //将字符串拆分为len-字符段并单独存储在数组槽中
function get_bigrams($ theString,$ len)
{
$ s = strtolower($ theString);
$ v = array();
$ slength = strlen($ s) - ($ len-1); //我们不再使用$ len-1,所以我们不使用短字符。

for($ m = 0; $ m< $ slength; $ m ++)
{
$ v [] = substr($ s,$ m,$ len);
}
return $ v;
}

不要担心字符串中的空格 - 如果你考虑模糊搜索。



所以你得到的bigrams,输入他们在表中,链接到表1的整个文本通过和索引列...现在什么?



现在,无论何时您搜索一个术语,例如我最喜欢的术语搜索 - 您可以使用php函数将其转换为一个二元组数组。然后使用它在您的bigram表(2)上创建SQL语句的IN(..)部分。下面是一个示例:

  select count(b_id)as matches,a.hd_id,description,from table2 a 
inner join table1 b on(a.hd_id = b.hd_id)
其中bigram(。$ sqlstr。)
group by hd_id order by matches desc limit X

我已将$ sqlstr作为PHP字符串引用 - 您可以使用implode自行构建一个逗号分隔的列表



如果操作正确,上面的查询返回最匹配的模糊搜索项,取决于它的长度。你选择的双字母组合。您选择的长度具有基于整个搜索字符串的预期长度的相对效率。



最后 - 上面的查询只是给出了模糊匹配排名。你可以通过比较不只是匹配,而是匹配与整体bigram计数,这将帮助偏差长搜索字符串相比,短字符串,并加强。我已经停在这里,因为在这个时刻它变得更加具体的应用程序。



希望这有助于!



R


One simple question is preg_match in PHP and like in mysql query are same?

Main Question:

Consider Following are my two tables table1 and table2

Table 1                                                                       Table 2

+-------+-------------------------+      +-------+------------------------------+
| ID    | Model                   |      | ID    | Model                        |
+-------+-------------------------+      +-------+------------------------------+
| 1     | iPad 2 WiFi 16GB        |      | 1     | iPad2 WiFi 16GB              |
| 2     | iPhone 4S 16GB          |      | 2     | iPhone4S 16GB                |
| 3     | iPod Touch(4th Gen)8GB  |      | 3     |iPod Touch 4th Generation 8GB |
+-------+-------------------------+      +-------+------------------------------+

Now what i wanna do is to compare these two tables as you can see iPad 2 WiFi 16GB and iPad2 WiFi 16GB or iPod Touch(4th Gen)8GB and iPod Touch 4th Generation 8GB both are the same but it doesnot show if i put in my query where Table1.model = Table2.model because they are not the exact match. What I wanna do is to compare these rows with mysql query by using like or anyother way so it'll compare the both table rows which are alike. Kindly let me know how to write such sql query.

I tried the following sql query but it didnot return all the rows like it didnot return those type of rows that are mentioned in the above example.

SELECT table1.model as model1, table2.model as model2
FROM table1,table2 WHERE table1.model REGEXP table2.model 

解决方案

Two questions - are the descriptions standard (descriptions don't change) or are they entered by a user? If they're standard, add a column that is an integer and do comparison on this column.

If its entered by the user, your work is more complicated because you're looking for something that is more fuzzy search. I used a bi-gram search algorithm to rank similarity between two strings, but this can't be done directly in mySQL.

In lieu of a fuzzy search, you could use LIKE, but it's efficiency is limited to doing table scan's if you end up putting the '%' in the beginning of the search term. Also, it implies you can get a match on the substring portion you choose, meaning you'd need to know the substring ahead of time.

I'd be happy to elaborate more once I know what you're trying to do.

EDIT1: Ok, given your elaboration, you will need to do a fuzzy style search as I mentioned. I use a bi-gram method, which involves taking each entry made by user and splitting it into chunks of 2 or 3 characters. I then store each of these chunks in another table with each entry keyed back to the actual description.

Example:

Description1: "A fast run forward" Description2: "A short run forward"

If you break each into 2 char chunks - 'A ', ' f', 'fa', 'as','st'.....

Then you can compare the number of 2 char chunks that match both strings and get a "score" which will connote accuracy or similarity between the two.

Given I don't know what development language you're using, I'll leave the implementation out, but this is something that will need to be done not explicitly in mySQL.

Or the lazy alternative would be to use a cloud search service like Amazon has that will provide search based on terms you give it...not sure if they allow you to continously add new descriptions to consider though, and depending on your application, it can be a bit costly (IMHO).

R

For another SO post on the bigram implementation - see this SO bigram / fuzzy search

--- Update per questioner elaboration---

First, I'm assuming you read the theory on the links I provided..second, I'll try to keep it as DB agnostic as possible, since it doesn't need mySQL (though I use it, and it works more than fine)

Ok, so the bigram method works ok in making/comparing in-memory arrays only if the possible matches are relatively small, otherwise it suffers from a table-scan performance like a mysql table without indexes fairly quickly. So, you're going to use the database strengths to help do the indexing for you.

What you need is one table to hold the user entered "terms" or text that you're looking to compare. The simplest form is a table with two columns, one is a unique auto-increment integer which will be indexed, we'll call hd_id below, the second is a varchar(255) if the strings are pretty short, or TEXT if they can get long - you can name this whatever you want.

Then, you'll need to make another table that has at least THREE columns - one for the reference column back to the other table's auto-incremented column (we'll call this hd_id below), the second would be a varchar() of say 5 chars at most (this will hold your bigram chunks) which we'll call "bigram" below, and the third an auto-incrementing column called b_id below. This table will hold all the bigrams for each user's entry and tie back to the overall entry. You'll want to index the varchar column by itself (or first in order in a compound index).

Now, every time a user enters a term you want to search, you need to enter the term in the first table, then dissect the term it into bigrams and enter each chunk into the second table using the reference back to the overall term in the first table to complete the relationship. This way, you're doing the dissection in PHP, but letting mySQL or whatever database do the index optimization for you. It may help in the bigram phase to store the number of bigrams made in table 1 for the calculation phase. Below is some code in PHP to give you an idea on how to create the bigrams:

// split the string into len-character segments and store seperately in array slots
function get_bigrams($theString,$len)   
{
   $s=strtolower($theString);
   $v=array();
   $slength=strlen($s)-($len-1);     // we stop short of $len-1 so we don't make short chunks as we run out of characters

   for($m=0;$m<$slength;$m++)
   {
      $v[]=substr($s,$m,$len);
   }
   return $v;
}    

Don't worry about spaces in the strings - they're actually really helpful if you think about fuzzy search.

So you get the bigrams, enter them in a table, linked to the overall text in table 1 via and indexed column...now what?

Now whenever you search for a term such as "My favorite term to search for" - you can use the php function to turn it into an array of bigrams. You then use this to create the IN (..) part of a SQL statement on your bigram table(2). Below is an example:

select count(b_id) as matches,a.hd_id,description, from table2 a
inner join table1 b on (a.hd_id=b.hd_id)
where bigram in (" . $sqlstr . ")
group by hd_id order by matches desc limit X

I've left the $sqlstr as a PHP string reference - you could construct this yourself as a comma separated list from the bigram function using implode or whatever on the array returned from get_bigrams or parameterize if you like too.

If done correctly, the query above returns the most closely matched fuzzy search terms depending on the length of the bigram you chose. The length you choose has a relative efficacy based on your expected length of the overall search strings.

Lastly - the query above, just gives a fuzzy match rank. You can play around with and enhance by comparing not just matches, but matches vs. overall bigram count which will help de-bias long search strings compared to short strings. I've stopped here because at this juncture it becomes much more application specific.

Hope this helps!

R

这篇关于模式与mysql之间两个表列比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆