检测文本字符串中的(顽皮或漂亮)URL或链接 [英] Detecting a (naughty or nice) URL or link in a text string

查看:244
本文介绍了检测文本字符串中的(顽皮或漂亮)URL或链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何检测(用正则表达式或启发式方法)文本字符串(例如评论)中的网站链接?

目的是防止垃圾邮件. HTML被剥离,因此我需要检测复制粘贴的邀请. 垃圾邮件发送者发布链接应该不经济,因为大多数用户无法成功访问该页面.我想要有关最佳做法的建议,参考或讨论.

The purpose is to prevent spam. HTML is stripped so I need to detect invitations to copy-and-paste. It should not be economical for a spammer to post links because most users could not successfully get to the page. I would like suggestions, references, or discussion on best-practices.

一些目标:

  • 低落的果实,如格式正确的URL(http://some-fqdn/some/valid/path.ext)
  • URL,但没有http://前缀(即有效的FQDN +有效的HTTP路径)
  • 其他有趣的事情
  • The low-hanging fruit like well-formed URLs (http://some-fqdn/some/valid/path.ext)
  • URLs but without the http:// prefix (i.e. a valid FQDN + valid HTTP path)
  • Any other funny business

当然,我阻止了垃圾邮件,但是可以使用相同的过程来自动链接文本.

Of course, I am blocking spam, but the same process could be used to auto-link text.

这是我在想的一些事情.

Here are some things I'm thinking.

  • 内容为母语散文,因此我可以很高兴地被发现
  • 我应该首先删除所有空格以捕获"www .example.com"吗?普通用户是否会知道自己删除空间,还是任何浏览器都会做什么"并为您剥离?
  • 也许多次通过是更好的策略,请扫描以下内容:
    • 格式正确的URL
    • 所有非空白后跟.".后跟任何有效的TLD
    • 还有什么?
    • The content is native-language prose so I can be trigger-happy in detection
    • Should I strip out all whitespace first, to catch "www .example.com"? Would common users know to remove the space themselves, or do any browsers "do-what-I-mean" and strip it for you?
    • Maybe multiple passes is a better strategy, with scans for:
      • Well-formed URLs
      • All non-whitespace followed by '.' followed by any valid TLD
      • Anything else?

      我已经阅读了这些内容,现在在此处进行了记录,因此您可以根据需要在这些问题中引用正则表达式.

      I've read these and they are now documented here, so you can just references the regexes in those questions if you want.

      • replace URL with HTML Links javascript
      • What is the best regular expression to check if a string is a valid URL
      • Getting parts of a URL (Regex)

      哇,我在这里列出了一些很好的启发式方法!对我来说,最好的物有所值"是以下各项的综合:

      Wow, I there are some very good heuristics listed in here! For me, the best bang-for-the-buck is a synthesis of the following:

      1. @Jon Bright检测TLD的技术(一个很好的防御性瓶颈)
      2. 对于那些可疑的字符串,请按照@capar将该点替换为点状字符
      3. @Sharkey的下划线& middot;是一个好看的点字符(即"· "). & middot;也是字边界,因此很难随便复制&粘贴.
      1. @Jon Bright's technique of detecting TLDs (a good defensive chokepoint)
      2. For those suspicious strings, replace the dot with a dot-looking character as per @capar
      3. A good dot-looking character is @Sharkey's subscripted · (i.e. "·"). · is also a word boundary so it's harder to casually copy & paste.

      这应该使垃圾邮件发送者的CPM足够低以满足我的需求; 标记为不适当"的用户反馈应该包含其他任何内容.列出的其他解决方案也非常有用:

      That should make a spammer's CPM low enough for my needs; the "flag as inappropriate" user feedback should catch anything else. Other solutions listed are also very useful:

      • 去除所有点点划线的东西(@Sharkey对自己的回答的评论)
      • @Sporkmonger对客户端Javascript的要求,该要求将必需的隐藏字段插入到表单中.
      • Ping URL服务器端以确定它是否是网站. (也许我可以根据@Nathan通过SpamAssassin或其他贝叶斯过滤器运行HTML.)
      • 在Chrome的智能地址栏中查看源代码,以了解Google使用的巧妙技巧
      • 呼唤OWASP AntiSAMY或其他Web服务以检测垃圾邮件/恶意软件.

      推荐答案

      我正在集中精力避免垃圾邮件发送者.这导致两个子假设:使用该系统的人员将积极尝试违反您的检查,并且您的目标只是检测URL的存在,而不是提取完整的URL.如果您的目标是其他目标,则此解决方案看起来会有所不同.

      I'm concentrating my answer on trying to avoid spammers. This leads to two sub-assumptions: the people using the system will therefore be actively trying to contravene your check and your goal is only to detect the presence of a URL, not to extract the complete URL. This solution would look different if your goal is something else.

      我认为您最好的选择将是TLD.有两个字母的ccTLD和(目前)相对较小的其他ccTLD列表.这些需要以小数点为前缀,并以斜杠或某些单词边界作为后缀.正如其他人指出的那样,这并不是完美的.在没有禁止合法的我再试一次.它不起作用"或类似的情况下,没有办法获得"buyfunkypharmaceuticals.it".所有这些,这就是我的建议:

      I think your best bet is going to be with the TLD. There are the two-letter ccTLDs and the (currently) comparitively small list of others. These need to be prefixed by a dot and suffixed by either a slash or some word boundary. As others have noted, this isn't going to be perfect. There's no way to get "buyfunkypharmaceuticals . it" without disallowing the legitimate "I tried again. it doesn't work" or similar. All of that said, this would be my suggestion:

      [^\b]\.([a-zA-Z]{2}|aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel)[\b/]
      

      这将得到:

      • buyfunkypharmaceuticals.it
      • google.com
      • http://stackoverflo**w.com/**questions/700163/

      当人们开始混淆其URL并替换为"时,它当然会中断.与点".但是,再次假设垃圾邮件发送者是您的目标,如果他们开始这样做,则其点击率将再下降几个数量级,降至零.我认为,足够多的人知道对URL进行模糊处理的信息,而没有足够信息的人却很少访问垃圾邮件站点,这是一个微不足道的交集.该解决方案应该使您能够检测到可复制并粘贴到地址栏的所有URL,同时将附带损害保持在最低限度.

      It will of course break as soon as people start obfuscating their URLs, replacing "." with " dot ". But, again assuming spammers are your goal here, if they start doing that sort of thing, their click-through rates are going to drop another couple of orders of magnitude toward zero. The set of people informed enough to deobfuscate a URL and the set of people uninformed enough to visit spam sites have, I think, a miniscule intersection. This solution should let you detect all URLs that are copy-and-pasteable to the address bar, whilst keeping collateral damage to a bare minimum.

      这篇关于检测文本字符串中的(顽皮或漂亮)URL或链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆