检测文本字符串中的(顽皮或漂亮)URL或链接 [英] Detecting a (naughty or nice) URL or link in a text string
问题描述
如何检测(用正则表达式或启发式方法)文本字符串(例如评论)中的网站链接?
目的是防止垃圾邮件. HTML被剥离,因此我需要检测复制粘贴的邀请. 垃圾邮件发送者发布链接应该不经济,因为大多数用户无法成功访问该页面.我想要有关最佳做法的建议,参考或讨论.
The purpose is to prevent spam. HTML is stripped so I need to detect invitations to copy-and-paste. It should not be economical for a spammer to post links because most users could not successfully get to the page. I would like suggestions, references, or discussion on best-practices.
一些目标:
- 低落的果实,如格式正确的URL(
http://some-fqdn/some/valid/path.ext
) - URL,但没有
http://
前缀(即有效的FQDN +有效的HTTP路径) - 其他有趣的事情
- The low-hanging fruit like well-formed URLs (
http://some-fqdn/some/valid/path.ext
) - URLs but without the
http://
prefix (i.e. a valid FQDN + valid HTTP path) - Any other funny business
当然,我阻止了垃圾邮件,但是可以使用相同的过程来自动链接文本.
Of course, I am blocking spam, but the same process could be used to auto-link text.
这是我在想的一些事情.
Here are some things I'm thinking.
- 内容为母语散文,因此我可以很高兴地被发现
- 我应该首先删除所有空格以捕获"
www .example.com
"吗?普通用户是否会知道自己删除空间,还是任何浏览器都会做什么"并为您剥离? - 也许多次通过是更好的策略,请扫描以下内容:
- 格式正确的URL
- 所有非空白后跟.".后跟任何有效的TLD
- 还有什么?
- The content is native-language prose so I can be trigger-happy in detection
- Should I strip out all whitespace first, to catch "
www .example.com
"? Would common users know to remove the space themselves, or do any browsers "do-what-I-mean" and strip it for you? - Maybe multiple passes is a better strategy, with scans for:
- Well-formed URLs
- All non-whitespace followed by '.' followed by any valid TLD
- Anything else?
我已经阅读了这些内容,现在在此处进行了记录,因此您可以根据需要在这些问题中引用正则表达式.
I've read these and they are now documented here, so you can just references the regexes in those questions if you want.
- replace URL with HTML Links javascript
- What is the best regular expression to check if a string is a valid URL
- Getting parts of a URL (Regex)
哇,我在这里列出了一些很好的启发式方法!对我来说,最好的物有所值"是以下各项的综合:
Wow, I there are some very good heuristics listed in here! For me, the best bang-for-the-buck is a synthesis of the following:
- @Jon Bright检测TLD的技术(一个很好的防御性瓶颈)
- 对于那些可疑的字符串,请按照@capar将该点替换为点状字符
- @Sharkey的下划线& middot;是一个好看的点字符(即"· "). & middot;也是字边界,因此很难随便复制&粘贴.
- @Jon Bright's technique of detecting TLDs (a good defensive chokepoint)
- For those suspicious strings, replace the dot with a dot-looking character as per @capar
- A good dot-looking character is @Sharkey's subscripted · (i.e. "·"). · is also a word boundary so it's harder to casually copy & paste.
这应该使垃圾邮件发送者的CPM足够低以满足我的需求; 标记为不适当"的用户反馈应该包含其他任何内容.列出的其他解决方案也非常有用:
That should make a spammer's CPM low enough for my needs; the "flag as inappropriate" user feedback should catch anything else. Other solutions listed are also very useful:
- 去除所有点点划线的东西(@Sharkey对自己的回答的评论)
- @Sporkmonger对客户端Javascript的要求,该要求将必需的隐藏字段插入到表单中.
- Ping URL服务器端以确定它是否是网站. (也许我可以根据@Nathan通过SpamAssassin或其他贝叶斯过滤器运行HTML.)
- 在Chrome的智能地址栏中查看源代码,以了解Google使用的巧妙技巧
- 呼唤OWASP AntiSAMY或其他Web服务以检测垃圾邮件/恶意软件.
推荐答案
我正在集中精力避免垃圾邮件发送者.这导致两个子假设:使用该系统的人员将积极尝试违反您的检查,并且您的目标只是检测URL的存在,而不是提取完整的URL.如果您的目标是其他目标,则此解决方案看起来会有所不同.
I'm concentrating my answer on trying to avoid spammers. This leads to two sub-assumptions: the people using the system will therefore be actively trying to contravene your check and your goal is only to detect the presence of a URL, not to extract the complete URL. This solution would look different if your goal is something else.
我认为您最好的选择将是TLD.有两个字母的ccTLD和(目前)相对较小的其他ccTLD列表.这些需要以小数点为前缀,并以斜杠或某些单词边界作为后缀.正如其他人指出的那样,这并不是完美的.在没有禁止合法的我再试一次.它不起作用"或类似的情况下,没有办法获得"buyfunkypharmaceuticals.it".所有这些,这就是我的建议:
I think your best bet is going to be with the TLD. There are the two-letter ccTLDs and the (currently) comparitively small list of others. These need to be prefixed by a dot and suffixed by either a slash or some word boundary. As others have noted, this isn't going to be perfect. There's no way to get "buyfunkypharmaceuticals . it" without disallowing the legitimate "I tried again. it doesn't work" or similar. All of that said, this would be my suggestion:
[^\b]\.([a-zA-Z]{2}|aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel)[\b/]
这将得到:
- buyfunkypharmaceutical s.it
- googl e.com
- http://stackoverflo ** w.com/** questions/700163/
- buyfunkypharmaceuticals.it
- google.com
- http://stackoverflo**w.com/**questions/700163/
当人们开始混淆其URL并替换为"时,它当然会中断.与点".但是,再次假设垃圾邮件发送者是您的目标,如果他们开始这样做,则其点击率将再下降几个数量级,降至零.我认为,足够多的人知道对URL进行模糊处理的信息,而没有足够信息的人却很少访问垃圾邮件站点,这是一个微不足道的交集.该解决方案应该使您能够检测到可复制并粘贴到地址栏的所有URL,同时将附带损害保持在最低限度.
It will of course break as soon as people start obfuscating their URLs, replacing "." with " dot ". But, again assuming spammers are your goal here, if they start doing that sort of thing, their click-through rates are going to drop another couple of orders of magnitude toward zero. The set of people informed enough to deobfuscate a URL and the set of people uninformed enough to visit spam sites have, I think, a miniscule intersection. This solution should let you detect all URLs that are copy-and-pasteable to the address bar, whilst keeping collateral damage to a bare minimum.
这篇关于检测文本字符串中的(顽皮或漂亮)URL或链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!