在MYSQL / PHP中检查重复TEXT字段的最佳方式是什么？ [英] What is the best way to check for duplicate TEXT fields in MYSQL/PHP?

查看：152 发布时间：2017/7/21 1:07:49 php mysql hash duplicates

本文介绍了在MYSQL / PHP中检查重复TEXT字段的最佳方式是什么？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的代码拉〜1000个HTML文件，提取相关信息&然后将该信息存储在MySQL TEXT字段中（通常很长）。我正在寻找一个系统以防止DB中的重复条目

我的第一个想法是向表添加一个HASH字段（可能是MD5），将哈希列表拉到每次跑步的开始在插入数据库之前检查重复项。

第二个想法是存储文件长度（字节或字符或其他），索引，&检查重复的文件长度，如果找到重复的长度，请双击内容。

不知道什么是最佳解决方案性能方面。也许有一个更好的方法？

如果有一种有效的方法来检查文件是否> 95％是类似的，这将是理想的，但我怀疑是吗？ p>

感谢任何帮助！

BTW我使用PHP5 / Kohana

编辑：

刚刚有一个检查相似性的想法：我可以计算所有字母数字字符&记录每个

的发生，例如：17aB ... = 1a，7b，10c，27c，...

潜在的问题将是一个字符数的上限（大约61？）

我想，假阳性仍然是罕见的。。

好主意/坏主意？

解决方案

可能是最好的你可能有碰撞，但是它们将非常罕见。

使哈希字段成为表的唯一键，并捕获重复的错误代码。或者使用 insert ignore 或 insert replace 。

My code pulls ~1000 HTML files, extracts the relevant information & then stores that information in a MySQL TEXT field (as it is usually quite long). I am looking for a system to prevent duplicate entries in the DB

My first idea is to add a HASH field to the table (probably MD5), pull the hash list at the beginning of each run & check for duplicates before inserting into the DB.

Second idea is to store the file length (bytes or chars or whatever), index that, & check for duplicate file lengths, doublechecking content if a duplicate length is found.

No idea what is the best solution performance-wise. Perhaps there is a better way?

If there is an efficient way to check if files are >95% similar that would be ideal, but I doubt there is?

Thanks for any help!

BTW I am using PHP5/Kohana

EDIT:

just had an idea on checking for similarity: I could count all alphanumeric characters & log the occurrence of each

eg: 17aB... = 1a,7b,10c,27c,...

potential problem would be the upper limit for a char count (around 61?)

I imagine false positives would still be rare . . .

good idea/bad idea?

解决方案

The hash idea is probably the best. You might have collisions, but they would be exceedingly rare.

Make the hash field a unique key for the table, and catch the duplicate error code. Or use insert ignore or insert replace.

这篇关于在MYSQL / PHP中检查重复TEXT字段的最佳方式是什么？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在MYSQL / PHP中检查重复TEXT字段的最佳方式是什么？ [英] What is the best way to check for duplicate TEXT fields in MYSQL/PHP?

问题描述

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

在MYSQL / PHP中检查重复TEXT字段的最佳方式是什么？ [英] What is the best way to check for duplicate TEXT fields in MYSQL/PHP?

问题描述

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭