检测文本字符串中的(顽皮或漂亮)URL或链接 [英] Detecting a (naughty or nice) URL or link in a text string

查看：244 发布时间：2020/4/27 3:38:23 language-agnostic url sanitization spam-prevention

本文介绍了检测文本字符串中的(顽皮或漂亮)URL或链接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何检测(用正则表达式或启发式方法)文本字符串(例如评论)中的网站链接?

目的是防止垃圾邮件. HTML被剥离，因此我需要检测复制粘贴的邀请. 垃圾邮件发送者发布链接应该不经济，因为大多数用户无法成功访问该页面.我想要有关最佳做法的建议，参考或讨论.

The purpose is to prevent spam. HTML is stripped so I need to detect invitations to copy-and-paste. It should not be economical for a spammer to post links because most users could not successfully get to the page. I would like suggestions, references, or discussion on best-practices.

一些目标:

低落的果实，如格式正确的URL(http://some-fqdn/some/valid/path.ext)
URL，但没有http://前缀(即有效的FQDN +有效的HTTP路径)
其他有趣的事情

The low-hanging fruit like well-formed URLs (http://some-fqdn/some/valid/path.ext)
URLs but without the http:// prefix (i.e. a valid FQDN + valid HTTP path)
Any other funny business

当然，我阻止了垃圾邮件，但是可以使用相同的过程来自动链接文本.

Of course, I am blocking spam, but the same process could be used to auto-link text.

这是我在想的一些事情.

Here are some things I'm thinking.

内容为母语散文，因此我可以很高兴地被发现
我应该首先删除所有空格以捕获"www .example.com"吗?普通用户是否会知道自己删除空间，还是任何浏览器都会做什么"并为您剥离?
也许多次通过是更好的策略，请扫描以下内容:
- 格式正确的URL
- 所有非空白后跟.".后跟任何有效的TLD
- 还有什么?
- The content is native-language prose so I can be trigger-happy in detection
- Should I strip out all whitespace first, to catch "www .example.com"? Would common users know to remove the space themselves, or do any browsers "do-what-I-mean" and strip it for you?
- Maybe multiple passes is a better strategy, with scans for:
  - Well-formed URLs
  - All non-whitespace followed by '.' followed by any valid TLD
  - Anything else?
  我已经阅读了这些内容，现在在此处进行了记录，因此您可以根据需要在这些问题中引用正则表达式.
  
  I've read these and they are now documented here, so you can just references the regexes in those questions if you want.
  - replace URL with HTML Links javascript
  - What is the best regular expression to check if a string is a valid URL
  - Getting parts of a URL (Regex)
  哇，我在这里列出了一些很好的启发式方法！对我来说，最好的物有所值"是以下各项的综合:
  
  Wow, I there are some very good heuristics listed in here! For me, the best bang-for-the-buck is a synthesis of the following:
  1. @Jon Bright检测TLD的技术(一个很好的防御性瓶颈)
  2. 对于那些可疑的字符串，请按照@capar将该点替换为点状字符
  3. @Sharkey的下划线& middot;是一个好看的点字符(即"_·"). & middot;也是字边界，因此很难随便复制&粘贴.
  1. @Jon Bright's technique of detecting TLDs (a good defensive chokepoint)
  2. For those suspicious strings, replace the dot with a dot-looking character as per @capar
  3. A good dot-looking character is @Sharkey's subscripted · (i.e. "_·"). · is also a word boundary so it's harder to casually copy & paste.
  这应该使垃圾邮件发送者的CPM足够低以满足我的需求；标记为不适当"的用户反馈应该包含其他任何内容.列出的其他解决方案也非常有用:
  
  That should make a spammer's CPM low enough for my needs; the "flag as inappropriate" user feedback should catch anything else. Other solutions listed are also very useful:
  - 去除所有点点划线的东西(@Sharkey对自己的回答的评论)
  - @Sporkmonger对客户端Javascript的要求，该要求将必需的隐藏字段插入到表单中.
  - Ping URL服务器端以确定它是否是网站. (也许我可以根据@Nathan通过SpamAssassin或其他贝叶斯过滤器运行HTML.)
  - 在Chrome的智能地址栏中查看源代码，以了解Google使用的巧妙技巧
  - 呼唤OWASP AntiSAMY或其他Web服务以检测垃圾邮件/恶意软件.
  推荐答案
  
  我正在集中精力避免垃圾邮件发送者.这导致两个子假设:使用该系统的人员将积极尝试违反您的检查，并且您的目标只是检测URL的存在，而不是提取完整的URL.如果您的目标是其他目标，则此解决方案看起来会有所不同.
  
  I'm concentrating my answer on trying to avoid spammers. This leads to two sub-assumptions: the people using the system will therefore be actively trying to contravene your check and your goal is only to detect the presence of a URL, not to extract the complete URL. This solution would look different if your goal is something else.
  
  我认为您最好的选择将是TLD.有两个字母的ccTLD和(目前)相对较小的其他ccTLD列表.这些需要以小数点为前缀，并以斜杠或某些单词边界作为后缀.正如其他人指出的那样，这并不是完美的.在没有禁止合法的我再试一次.它不起作用"或类似的情况下，没有办法获得"buyfunkypharmaceuticals.it".所有这些，这就是我的建议:
  
  I think your best bet is going to be with the TLD. There are the two-letter ccTLDs and the (currently) comparitively small list of others. These need to be prefixed by a dot and suffixed by either a slash or some word boundary. As others have noted, this isn't going to be perfect. There's no way to get "buyfunkypharmaceuticals . it" without disallowing the legitimate "I tried again. it doesn't work" or similar. All of that said, this would be my suggestion:
```
[^\b]\.([a-zA-Z]{2}|aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel)[\b/]
```
  这将得到:
  - buyfunkypharmaceutical s.it
  - googl e.com
  - http://stackoverflo ** w.com/** questions/700163/
  - buyfunkypharmaceuticals.it
  - google.com
  - http://stackoverflo**w.com/**questions/700163/
  当人们开始混淆其URL并替换为"时，它当然会中断.与点".但是，再次假设垃圾邮件发送者是您的目标，如果他们开始这样做，则其点击率将再下降几个数量级，降至零.我认为，足够多的人知道对URL进行模糊处理的信息，而没有足够信息的人却很少访问垃圾邮件站点，这是一个微不足道的交集.该解决方案应该使您能够检测到可复制并粘贴到地址栏的所有URL，同时将附带损害保持在最低限度.
  
  It will of course break as soon as people start obfuscating their URLs, replacing "." with " dot ". But, again assuming spammers are your goal here, if they start doing that sort of thing, their click-through rates are going to drop another couple of orders of magnitude toward zero. The set of people informed enough to deobfuscate a URL and the set of people uninformed enough to visit spam sites have, I think, a miniscule intersection. This solution should let you detect all URLs that are copy-and-pasteable to the address bar, whilst keeping collateral damage to a bare minimum.
  
  这篇关于检测文本字符串中的(顽皮或漂亮)URL或链接的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

检测文本字符串中的(顽皮或漂亮)URL或链接 [英] Detecting a (naughty or nice) URL or link in a text string

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

检测文本字符串中的(顽皮或漂亮)URL或链接 [英] Detecting a (naughty or nice) URL or link in a text string

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭