使用MySQL检测垃圾邮件发送者 [英] Detecting spammers with MySQL
问题描述
我看到越来越多的用户在我的网站上注册,仅向其他用户发送重复的垃圾邮件.我添加了一些服务器端代码,以使用以下mysql查询检测重复消息:
I see an ever increasing number of users signing up on my site to just send duplicate SPAM messages to other users. I've added some server side code to detect duplicate messages with the following mysql query:
SELECT count(content) as msgs_sent
FROM messages
WHERE sender_id = '.$sender_id.'
GROUP BY content having count(content) > 10
查询效果很好,但是现在他们通过更改消息中的一些charctersr来解决此问题.有没有一种方法可以使用MySQL进行检测,或者我需要查看从MySQL返回的每个分组,然后使用PHP确定相似性百分比?
The query works well but now they're getting around this by changing a few charctersr in their messages. Is there a way to detect this with MySQL or do I need to look at each grouping returned from MySQL and then use PHP to determine the percentage of similarity?
有什么想法或建议吗?
推荐答案
全文匹配
You could look at implementing something similar to the MATCH
example here:
mysql> SELECT id, body, MATCH (title,body) AGAINST
-> ('Security implications of running MySQL as root') AS score
-> FROM articles WHERE MATCH (title,body) AGAINST
-> ('Security implications of running MySQL as root');
+----+-------------------------------------+-----------------+
| id | body | score |
+----+-------------------------------------+-----------------+
| 4 | 1. Never run mysqld as root. 2. ... | 1.5219271183014 |
| 6 | When configured properly, MySQL ... | 1.3114095926285 |
+----+-------------------------------------+-----------------+
2 rows in set (0.00 sec)
因此,以您的示例为例:
SELECT id, MATCH (content) AGAINST ('your string') AS score
FROM messages
WHERE MATCH (content) AGAINST ('your string')
AND score > 1;
请注意,要使用这些功能,您的content
列将需要为FULLTEXT
索引.
Note that to use these functions your content
column would need to be a FULLTEXT
index.
此示例中的score
是什么?
What is score
in this example?
它是relevance value
.它是通过下面描述的过程计算的:
It is a relevance value
. It is computed through the process described below:
对集合和查询中的每个正确单词进行加权 根据其在收集或查询中的意义. 因此,许多文档中存在的单词具有较低的 重量(甚至可能是零重量),因为它具有较低的重量 此特定集合中的语义值.反之,如果这个词 很少见,它的重量更大.单词的权重是 组合以计算该行的相关性.
Every correct word in the collection and in the query is weighted according to its significance in the collection or query. Consequently, a word that is present in many documents has a lower weight (and may even have a zero weight), because it has lower semantic value in this particular collection. Conversely, if the word is rare, it receives a higher weight. The weights of the words are combined to compute the relevance of the row.
在文档页中.
这篇关于使用MySQL检测垃圾邮件发送者的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!