发现字符串中的相似之处和模式 - Python [英] Spotting similarities and patterns within a string - Python

查看:71
本文介绍了发现字符串中的相似之处和模式 - Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我正在尝试解决的用例.

this is the use case I'm trying to figure this out for.

我有一份服务的垃圾邮件订阅列表,它们正在扼杀转化率和其他可用性研究.

I have a list of spam subscriptions to a service and they are killing conversion rate and other usability studies.

插入的电子邮件如下所示:

The emails inserted look like the following:

rogerep_dyeepvu@hotmail.com

rogerep_dyeepvu@hotmail.com

rogeram_ingramameb@hotmail.com

rogeram_ingramameb@hotmail.com

rogerew_jonesewct@hotmail.com

rogerew_jonesewct@hotmail.com

roger[...]_surname[...]@hotmail.com

roger[...]_surname[...]@hotmail.com


对于使用自动脚本发现这些条目,您有什么建议?感觉比实际看起来要复杂一些.


What would be your suggestions on spotting these entries by using an automated script? It feels a little more complicated than it actually looks.

非常感谢您的帮助!

推荐答案

我认为您无法轻松检查此问题.这不太可能是一个简单的字符串匹配问题,您可以在其中抛出正则表达式,因为我猜您对名称Roger"的使用只是一个示例,并且该位置可以出现任意数量的名称.您还可以运行其他海报提供的正则表达式之一,使用明显的名字和姓氏的每个排列对其进行参数化.这可能需要太长时间"和永远"之间的某个时间,并且会标记大量误报.

I don't think you can easily check for this. It's not likely to be a simple string matching problem that you can throw a regular expression at because I would guess that your use of the name 'Roger' was just an example, and that any number of names can appear in that position. You could also run one of the regular expressions supplied by the other posters, parameterising it with every permutation of obvious first name and last name. This will probably take somewhere between "too long" and "forever", and will flag up plenty of false positives.

另一种适用于您上面发布的模式的方法是采用用户名的最后 4 个字母,并将它们与某些内容进行比较.可以通过在合法文本上训练马尔可夫链来识别随机而不是合理排列的字符(给定特定语言),然后可以让您计算给定 4 个字母在该语言中按该顺序出现的概率.对于随机字母,此概率通常远低于合法名称的概率(尽管如果其中包含特殊字符或数字,则所有赌注都会取消).

Another approach, which works with the pattern you posted above, would be to take the last 4 letters of the username, and compare them against something. Spotting characters that are random as opposed to arranged sensibly (given a specific language) can be done by training a Markov Chain on legitimate text which can then allow you to calculate the probability of a given 4 letters appearing in that order in that language. For random letters, this probability will typically come in far lower than for a legitimate name (although if there are special characters or digits in there, all bets are off).

另一种方法可能是使用贝叶斯过滤器(例如,Python 中的 Reverend 之类的东西,尽管还有其他人)对合法电子邮件地址的最后 4 个字母进行了培训.如果您使数据可用,这可能会发现 95% 的随机数据.例如.不仅提交 4 个字母,还提交其中的每个 2 字母和 3 字母子字符串,以捕获每个字母的上下文.不过,我不认为这会像马尔可夫风格的方法那样有效.

Another way might be to use a Bayesian filter (eg. something like Reverend in Python, though there are others) trained on the last 4 letters of legitimate email addresses. This would probably spot 95% of the ones which were just random, providing you made the data usable. eg. Submitting not just the 4 letters but each of the 2-letter and 3-letter substrings inside it, to capture the context of each letter. I don't think this would work as well as the Markov-style method though.

无论您做什么检查,您都可以通过仅提交某些电子邮件地址来减少误报(例如,仅那些包含下划线的网络邮件地址,下划线前至少有 3 个字符,下划线后至少有 5 个字符.)

Whatever check you do, you can cut false positives by only submitting certain email addresses for it (eg. only those at webmail addresses, which contain an underscore, with at least 3 characters before the underscore and 5 characters after it.)

但最终,您永远无法确定它是垃圾邮件地址还是真实地址,直到它被用于一种目的或另一种目的.因此,如果可能的话,我建议放弃尝试分析内容并在其他地方解决问题.他们以什么方式杀死转化率?如果您以某种指标计算这些虚拟帐户,最好先添加验证阶段,并且只关心通过验证的帐户的指标.毕竟,有些人确实有像 rogerep_dyeepvu@hotmail.com 这样的地址.

But ultimately, you can never know whether it's a spam address or a real one for sure until it gets used for one purpose or the other. So if possible I'd suggest giving up on trying to analyse the content and fix the problem somewhere else. In what way are they killing conversion rate? If you're counting these dummy accounts in some sort of metric, you'd be best off adding a verification stage first and only caring about metrics for accounts that pass verification. Some people really do have addresses like rogerep_dyeepvu@hotmail.com, after all.

这篇关于发现字符串中的相似之处和模式 - Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆