正则表达式来查找网页中的所有链接 [英] Regular Expression to find all links in webpage

查看:88
本文介绍了正则表达式来查找网页中的所有链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我试图在网站上找到源代码中的所有链接,任何人都可以告诉我我需要在我的正则表达式中找到这些表达式吗? hr>

重复(除其他外):用于解析网页链接的正则表达式?



Google发现更多: html链接正则表达式网站:stackoverflow.com

我不确定这些将如何转换为C#(我自己还没有在C#中进行过任何开发),但这里是我如何使用JavaScript或ColdFusion完成的。它可能会让您了解如何在C#中执行此操作。



在JavaScript中,我认为这是可行的:

  rex = /.*href=\"([^\"]+)\"/; 
a = source.replace(rex,'\\\
$ 1')。split('\\\
');

之后a将是一个包含链接的数组......尽管我不确定这是否将按照我认为的方式工作,这里的想法是,替换会创建一个换行符分隔的列表(因为在URL中不能有换行符),然后可以通过拆分拆分列表()来获得你的数组。



通过ColdFusion中的比较,你将不得不做一些稍微不同的事情:

  a = REMatch('href =[^] +',source); 
for(i = 1; i< ArrayLen(a); i ++){
a [i] = mid(a [i],6,len(a [i]) - 1);
}

同样,我没有测试过它,但是复赛返回一个实例数组的表达式,然后for-next循环删除实际URL周围的href =。

I am trying to find all of the links in source code on a website, could anyone tell me the expression i would need to put in my Regex to find these?


Duplicate of (among others): Regular expression for parsing links from a webpage?

Google finds more: html links regex site:stackoverflow.com

解决方案

I'm not certain how these would translate to C# (I haven't done any development in C# myself yet), but here's how I might do it in JavaScript or ColdFusion. It might give you an idea about how you want to do it in C#.

In JavaScript I think this would work:

rex = /.*href="([^"]+)"/; 
a = source.replace(rex,'\n$1').split('\n'); 

after which a would be an array containing the links... though I'm not certain if that will work exactly the way I think it will. The idea here is that the replace creates a line-break-delimited list (because you can't have a line-break in a URL) and then you can break apart the list with split() to get your array.

By comparison in ColdFusion you would have to do something slightly different:

a = REMatch('href="[^"]+"',source); 
for (i = 1; i < ArrayLen(a); i++) {
  a[i] = mid(a[i],6,len(a[i])-1); 
} 

Again, I haven't tested it, but rematch returns an array of instances of the expression and then the for-next loop removes the href="" around the actual URL.

这篇关于正则表达式来查找网页中的所有链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆