正则表达式帮助:获取除延伸的.css,.js文件,个.jpg,.gif .png格式网址列表 [英] Regex Help: Get list of URL(s) except extention .css, .js, .jpg, .gif, .png

查看:97
本文介绍了正则表达式帮助:获取除延伸的.css,.js文件,个.jpg,.gif .png格式网址列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在正则表达式前pression越来越问题。

我想从给定字符串得到的所有URL(S),但不想让URL(S),这是最终以.jpg,的.css,.js文件,.GIF等。

下面是我的ASP.NET C#code,

 使用(VAR的客户=新的WebClient())
    {
        client.Headers [Htt的prequestHeader.UserAgent] =Mozilla的/ 5.0(视窗; U; Windows NT的6.1; EN-US; RV:1.9.2.13)的Gecko / 20101203火狐/ 3.6.13
        字符串结果= client.DownloadString(strBasicUrl);        正则表达式MyRegex ​​=新Regex(\"http(s)?://([\\\\w+?\\\\.\\\\w+])+([a-zA-Z0-9\\\\~\\\\!\\\\@\\\\#\\\\$\\\\%\\\\^\\\\&\\\\*\\\\(\\\\)_\\\\-\\\\=\\\\+\\\\\\\\\\\\/\\\\?\\\\.\\\\:\\\\;\\\\'\\\\,]*)?\", RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.Compiled);
        MatchCollection匹配= MyRegex.Matches(结果);
        的foreach(在比赛中VAR项)
        {
            litResult.Text + = item.ToString()+< BR>中;
        }
    }

我想改变这个正则表达式的前pression ....


如果我要求strBasicUrlhttp://www.Microsoft.com
那么它不应该是导致下面的网址
例如
http://i.microsoft.com/en-us/homepage/shared/templates/components/hpSearch/images/searchSprite.ltr.gif
http://i.microsoft.com/global/ImageStore/PublishingImages/Asset/Header/logo_skype.png

任何人可以帮助我在这,多AP preciated。

在此先感谢,
阿米特生主


解决方案

我认为迈克已经回答了你的问题,但我想在此自从你问的问题,并感谢你的问题,我学​​会了看aheads,看屁股和消极的看屁股定期EX pressions。

因此​​,这里是一个选择,如果你不想经常开火前pression在一个循环中。

 公众正则表达式MyRegex ​​=新的正则表达式(
  的href = \\(小于?网址>有!(:( ?!的javascript)(#))![A-ZA-Z0-9〜\\\\ \\\\ \\\\ \\\\ @#$ \\\\+
  \\\\ \\\\%^&\\\\放大器;放大器; \\\\ * \\\\(\\\\)_ \\\\ - \\\\ = \\\\ + \\\\\\\\\\\\ / \\\\ \\\\?:\\\\ ; \\\\ \\\\,] +)+
  ?(小于(!?。。。。。\\\\ PNG | \\\\ JS | \\\\ JPG | \\\\ JPEG | \\\\ CSS | \\\\的gif | \\\\拉链| \\\\ R+
  AR))\\(:$ |> | \\\\ S)?
RegexOptions.Multiline
| RegexOptions.CultureInvariant
| RegexOptions.Compiled
);

有关可读性,这里是正则表达式(不转义序列):

<$p$p><$c$c>href=\"(?<URL>(?:(?!javascript)(?!#))[a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]+)(?<!(?:\\.png|\\.js|\\.jpg|\\.jpeg|\\.css|\\.gif|\\.zip|\\.rar))\"(?:$|>|\\s)

假设你正在开发一个爬虫,你的正则表达式是不匹配的相对链接,而当我们相匹配的链接,你不应该匹配以JavaScript或#开头的链接(锚)。

在这里,你可以看到,我们正在捕获命名组的组的名称是URL。所以要得到你需要使用(你可能已经知道)的URL部分:

  match.Groups [网址]

下面是正则表达式的解释是:

  /// HREF =
/// [URL]:一个命名捕获组。 [(#!))[A-ZA-Z0-9 \\〜\\ \\ @ \\#\\ $ \\%\\ ^ \\&AMP(:( ?! JavaScript的?!);放大器; \\ * \\(\\)_ \\ - \\ = \\ + \\\\\\ / \\ \\ \\:\\,\\'\\] +]
///)[A-ZA-Z0-9 \\〜\\ \\ @ \\#\\ $ \\%\\ ^ \\&AMP(:( ?! JavaScript的?)(#!)!安培; \\ * \\(\\) _ \\ - \\ = \\ + \\\\\\ / \\ \\ \\:\\,\\'\\] +
///比赛前pression但不要捕捉它。 [(?!JavaScript的)(?!#)]
///(?!JavaScript的)(?!#)
///比赛如果后缀不存在。 [JavaScript的]
///的JavaScript
///的JavaScript
///比赛如果后缀不存在。 [#]
///#
///在这个类的任何字符:[A-ZA-Z0-9 \\〜\\ \\ @ \\#\\ $ \\%\\ ^ \\&放大器;放大器; \\ * \\(\\)_ \\ - \\ = \\ + \\! \\\\ / \\ \\ \\?:\\。; \\'\\,],一种或多种重复
///比赛如果preFIX不存在。 [(?:\\ PNG | \\ .js文件| \\ JPG格式| \\ JPEG格式| \\的.css | \\ .gif注意| \\ .ZIP | \\ .rar程序)]
///比赛前pression但不要捕捉它。 [\\ png格式| \\ .js文件| \\ JPG格式| \\ JPEG格式| \\的.css | \\ .gif注意| \\ .ZIP | \\ .rar程序]
///从8替代选择
/// \\巴纽
///文字。
/// PNG
/// \\ .js文件
///文字。
/// JS
/// \\ .JPG
///文字。
/// JPG
/// \\ .JPEG
///文字。
/// JPEG
/// \\的CSS
///文字。
/// CSS
/// \\ .gif注意
///文字。
/// GIF
/// \\。压缩
///文字。
/// 压缩
/// \\ .rar程序
///文字。
/// RAR
///
///比赛前pression但不要捕捉它。 [$ |&GT; | \\ S]
///从3替代选择
///行或字符串结束
///&GT;
///空白
///

这样你就不需要在循环运行第二个正前pression。你会得到绝对和相对URL。

希望它可以帮助...

I am getting problem in Regex expression.

I want to get all URL(s) from the given string but don't want to get URL(s) which is end with .jpg, .css, .js, .gif, etc.

Here is my ASP.NET C# code,

    using (var client = new WebClient())
    {
        client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows; U;   Windows NT 6.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13";
        string result = client.DownloadString(strBasicUrl);

        Regex MyRegex = new Regex("http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.Compiled);
        MatchCollection matches = MyRegex.Matches(result);
        foreach (var item in matches)
        {
            litResult.Text += item.ToString() + "<br>";
        }
    }

I want to change this Regex expression....

If I request strBasicUrl "http://www.Microsoft.com", 
then it should not be result below URLs
e.g.
http://i.microsoft.com/en-us/homepage/shared/templates/components/hpSearch/images/searchSprite.ltr.gif
http://i.microsoft.com/global/ImageStore/PublishingImages/Asset/Header/logo_skype.png

Can anybody help me in that, much appreciated.

Thanks in Advance, Amit Prajapati

解决方案

I think Mike has already answered your question, but I was thinking on this ever since you asked the question, and thanks to your question, I learnt look aheads, look behinds and negative look behinds in regular expressions.

So here is one alternative, if you don't want to fire regular expression in a loop.

public Regex MyRegex = new Regex(
  "href=\"(?<URL>(?:(?!javascript)(?!#))[a-zA-Z0-9\\~\\!\\@\\#\\$"+
  "\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]+)"+
  "(?<!(?:\\.png|\\.js|\\.jpg|\\.jpeg|\\.css|\\.gif|\\.zip|\\.r"+
  "ar))\"(?:$|>|\\s)",
RegexOptions.Multiline
| RegexOptions.CultureInvariant
| RegexOptions.Compiled
);

For readability, here is the regex (without escape sequence):

href="(?<URL>(?:(?!javascript)(?!#))[a-zA-Z0-9\~\!\@\#\$\%\^\&amp;\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]+)(?<!(?:\.png|\.js|\.jpg|\.jpeg|\.css|\.gif|\.zip|\.rar))"(?:$|>|\s)

Assuming you are developing a crawler, your regex is not matching the relative links, and when we match relative links you should not match the links which start with javascript or #(anchors).

Here you can see, we are capturing named group the name of the group is "URL". So to get the url part you need to use (you might be already aware):

match.Groups["URL"]

Here is the explanation of the regex:

///      href="
///  [URL]: A named capture group. [(?:(?!javascript)(?!#))[a-zA-Z0-9\~\!\@\#\$\%\^\&amp;\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]+]
///      (?:(?!javascript)(?!#))[a-zA-Z0-9\~\!\@\#\$\%\^\&amp;\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]+
///          Match expression but don't capture it. [(?!javascript)(?!#)]
///              (?!javascript)(?!#)
///                  Match if suffix is absent. [javascript]
///                      javascript
///                          javascript
///                  Match if suffix is absent. [#]
///                      #
///          Any character in this class: [a-zA-Z0-9\~\!\@\#\$\%\^\&amp;\*\(\)_\-\=\+\\\/\?\.\:\;\'\,], one or more repetitions
///  Match if prefix is absent. [(?:\.png|\.js|\.jpg|\.jpeg|\.css|\.gif|\.zip|\.rar)]
///      Match expression but don't capture it. [\.png|\.js|\.jpg|\.jpeg|\.css|\.gif|\.zip|\.rar]
///          Select from 8 alternatives
///              \.png
///                  Literal .
///                  png
///              \.js
///                  Literal .
///                  js
///              \.jpg
///                  Literal .
///                  jpg
///              \.jpeg
///                  Literal .
///                  jpeg
///              \.css
///                  Literal .
///                  css
///              \.gif
///                  Literal .
///                  gif
///              \.zip
///                  Literal .
///                  zip
///              \.rar
///                  Literal .
///                  rar
///  "
///  Match expression but don't capture it. [$|>|\s]
///      Select from 3 alternatives
///          End of line or string
///          >
///          Whitespace
///  

This way you don't need to run second regular expression in the loop. And you will get both absolute and relative url.

Hope it helps...

这篇关于正则表达式帮助:获取除延伸的.css,.js文件,个.jpg,.gif .png格式网址列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆