在MYSQL / PHP中检查重复TEXT字段的最佳方式是什么? [英] What is the best way to check for duplicate TEXT fields in MYSQL/PHP?
问题描述
我的代码拉〜1000个HTML文件,提取相关信息&然后将该信息存储在MySQL TEXT字段中(通常很长)。我正在寻找一个系统以防止DB中的重复条目
我的第一个想法是向表添加一个HASH字段(可能是MD5),将哈希列表拉到每次跑步的开始在插入数据库之前检查重复项。
第二个想法是存储文件长度(字节或字符或其他),索引,&检查重复的文件长度,如果找到重复的长度,请双击内容。
不知道什么是最佳解决方案性能方面。也许有一个更好的方法?
如果有一种有效的方法来检查文件是否> 95%是类似的,这将是理想的,但我怀疑是吗? p>
感谢任何帮助!
BTW我使用PHP5 / Kohana
编辑:
刚刚有一个检查相似性的想法:我可以计算所有字母数字字符&记录每个
的发生,例如:17aB ... = 1a,7b,10c,27c,...
潜在的问题将是一个字符数的上限(大约61?)
我想,假阳性仍然是罕见的。 。
好主意/坏主意?
可能是最好的你可能有碰撞,但是它们将非常罕见。
使哈希字段成为表的唯一键,并捕获重复的错误代码。或者使用 insert ignore
或 insert replace
。
My code pulls ~1000 HTML files, extracts the relevant information & then stores that information in a MySQL TEXT field (as it is usually quite long). I am looking for a system to prevent duplicate entries in the DB
My first idea is to add a HASH field to the table (probably MD5), pull the hash list at the beginning of each run & check for duplicates before inserting into the DB.
Second idea is to store the file length (bytes or chars or whatever), index that, & check for duplicate file lengths, doublechecking content if a duplicate length is found.
No idea what is the best solution performance-wise. Perhaps there is a better way?
If there is an efficient way to check if files are >95% similar that would be ideal, but I doubt there is?
Thanks for any help!
BTW I am using PHP5/Kohana
EDIT:
just had an idea on checking for similarity: I could count all alphanumeric characters & log the occurrence of each
eg: 17aB... = 1a,7b,10c,27c,...
potential problem would be the upper limit for a char count (around 61?)
I imagine false positives would still be rare . . .
good idea/bad idea?
The hash idea is probably the best. You might have collisions, but they would be exceedingly rare.
Make the hash field a unique key for the table, and catch the duplicate error code. Or use insert ignore
or insert replace
.
这篇关于在MYSQL / PHP中检查重复TEXT字段的最佳方式是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!