有没有办法检测像 putjbtghguhjjjanika 这样的字符串? [英] Is there any way to detect strings like putjbtghguhjjjanika?

查看:32
本文介绍了有没有办法检测像 putjbtghguhjjjanika 这样的字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

人们在我的网站上搜索,其中一些搜索是这样的:

People search in my website and some of these searches are these ones:

tapoktrpasawe
qweasd qwa as
aıe qwo ıak kqw
qwe qwe qwe a

我的问题是有什么方法可以检测与上面类似的字符串?

My question is there any way to detect strings that similar to ones above ?

我想不可能 100% 检测到它们,但任何解决方案都会受到欢迎:)

I suppose it is impossible to detect 100% of them, but any solution will be welcomed :)

我的意思是胡言乱语搜索".例如,有些人在我的搜索引擎中搜索诸如asdqweasdqw"、paykaprkg"、iwepr wepr ow"之类的字符串,而我想检测乱码搜索.

edit: I mean the "gibberish searches". For example some people search strings like "asdqweasdqw", "paykaprkg", "iwepr wepr ow" in my search engine, and I want to detect jibberish searches.

搜索结果是否为 0 或其他任何内容都没有关系.我不能用这个逻辑.

It doesn't matter if search result will be 0 or anything else. I can't use this logic.

如果我考虑常规词",一些新品牌或产品会被忽略.

Some new brands or products will be ignored if I will consider "regular words".

感谢您的帮助

推荐答案

您可以构建一个从一堆英文文本到字符转换的模型.例如,您会发现在 't' 之后有一个 'h' 是多么常见(很常见).在英语中,您希望在q"之后得到u".如果你得到一个 'q' 后跟一个不是 'u' 的东西,这种情况发生的概率非常低,因此它应该非常令人震惊.对表中的计数进行标准化,以便获得概率.然后对于查询,遍历矩阵并计算您采取的转换的乘积.然后按查询的长度进行归一化.当数量较少时,您可能会遇到胡言乱语(或其他语言的内容).

You could build a model of character to character transitions from a bunch of text in English. So for example, you find out how common it is for there to be a 'h' after a 't' (pretty common). In English, you expect that after a 'q', you'll get a 'u'. If you get a 'q' followed by something other than a 'u', this will happen with very low probability, and hence it should be pretty alarming. Normalize the counts in your tables so that you have a probability. Then for a query, walk through the matrix and compute the product of the transitions you take. Then normalize by the length of the query. When the number is low, you likely have a gibberish query (or something in a different language).

如果您有一堆查询日志,您可能会首先制作一个通用英文文本的模型,然后在该模型训练阶段对您自己的查询进行大量加权.

If you have a bunch of query logs, you might first make a model of general English text, and then heavily weight your own queries in that model training phase.

有关背景,请阅读马尔可夫链.

编辑,我在这里用 Python 实现了这个:

Edit, I implemented this here in Python:

https://github.com/rrenaud/Gibberish-Detector

然后 buggedcom 用 PHP 重写了它:

and buggedcom rewrote it in PHP:

https://github.com/buggedcom/Gibberish-Detector-PHP

my name is rob and i like to hack True
is this thing working? True
i hope so True
t2 chhsdfitoixcv False
ytjkacvzw False
yutthasxcvqer False
seems okay True
yay! True

这篇关于有没有办法检测像 putjbtghguhjjjanika 这样的字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆