如何匹配文本中的URI? [英] How to match URIs in text?

查看:113
本文介绍了如何匹配文本中的URI?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在文本块中发现URI?

How would one go about spotting URIs in a block of text?

这个想法是将这样的文本转换成链接.如果只考虑http和ftp方案,这是非常简单的.但是,我猜测一般的问题(考虑tel,mailto和其他URI方案)要复杂得多(如果可能的话).

The idea is to turn such runs of texts into links. This is pretty simple to do if one only considered the http(s) and ftp(s) schemes; however, I am guessing the general problem (considering tel, mailto and other URI schemes) is much more complicated (if it is even possible).

如果可能的话,我希望使用C#解决方案.谢谢.

I would prefer a solution in C# if possible. Thank you.

推荐答案

尽管众所周知,URI和URL很难与单个模式匹配,但是正则表达式可能是一个很好的起点.

Regexs may prove a good starting point for this, though URIs and URLs are notoriously difficult to match with a single pattern.

为说明起见,最简单的模式看起来相当复杂(以Perl 5表示法):

To illustrate, the simplest of patterns looks fairly complicated (in Perl 5 notation):

\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*

这将匹配 http://example.com/foo/bar-baz

ftp://192.168.0.1/foo/file.txt

但至少会导致以下问题:

but would cause problems for at least these:

  • mailto:support@stackoverflow.com(不匹配-没有//,但存在@)
  • ftp://192.168.0.1.2(匹配,但是数字太多,因此它不是有效的URI)
  • ftp://1000.120.0.1(匹配,但是IP地址需要0到255之间的数字,因此它不是有效的URI)
  • nonexistantscheme://obvious.false.positive
  • http://www.google.com/search?q=uri+regular+expression(匹配,但查询不是 我认为这是80:20规则的一种情况.如果您想抓住大多数东西,那么我会建议您找到一个不错的正则表达式,如果您自己不能自己写的话.
  • mailto:support@stackoverflow.com (no match - no //, but present @)
  • ftp://192.168.0.1.2 (match, but too many numbers, so it's not a valid URI)
  • ftp://1000.120.0.1 (match, but the IP address needs numbers between 0 and 255, so it's not a valid URI)
  • nonexistantscheme://obvious.false.positive
  • http://www.google.com/search?q=uri+regular+expression (match, but query isn't I think this is a case of the 80:20 rule. If you want to catch most things, then I would do as suggested an find a decent regular expression if you can't write one yourself.

如果您正在查看从相当可控的来源(例如机器生成的文字)中提取的文字,那么这将是最佳的做法.

If you're looking at text pulled from fairly controlled sources (e.g. machine generated), then this will the best course of action.

如果您绝对肯定要捕获遇到的每个URI,并且正在查看文本,那么我想我会寻找其中带有冒号的任何单词,例如\s(\w:\S+)\s.一旦找到合适的URI候选者,然后将其传递给所使用的任何库的URI类中的真实URI解析器.

If you absolutely positively have to catch every URI that you encounter, and you're looking at text from the wild, then I think I would look for any word with a colon in it e.g. \s(\w:\S+)\s. Once you have a suitable candidate for a URI, then pass it to the a real URI parser in the URI class of whatever library you're using.

如果您对为什么很难编写URI模式感兴趣,我想可能是第3类语法.

If you're interested in why it's so hard to write a URI pattern, the I guess it would be that the definition of a URI is done with a Type-2 grammar, while regular expressions can only parse languages from Type-3 grammars.

这篇关于如何匹配文本中的URI?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆