在.NET中使用正则表达式提取网址 [英] Extracting URLs using regex in .NET

查看:110
本文介绍了在.NET中使用正则表达式提取网址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经采取了灵感的例子显示在下面的网址<一href="http://en.csharp-online.net/CSharp_Regular_Ex$p$pssion_Recipes%E2%80%94Extracting_Groups_from_a_MatchCollection"相对=nofollow> CSHARP-在线 并打算在此页面<一个检索所有的网址href="http://www.alexa.com/topsites/category/Top/Society/History/By_Topic/Science/Engineering_and_Technology"相对=nofollow> ALEXA

I've taken inspiration from the example show in the following URL csharp-online and intended to retrieve all the URLs from this page alexa

using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;
using System.Text.RegularExpressions;
namespace ExtractingUrls
{
    class Program
    {
        static void Main(string[] args)
        {
            WebClient client = new WebClient();
            const string url = "http://www.alexa.com/topsites/category/Top/Society/History/By_Topic/Science/Engineering_and_Technology";
            string source = client.DownloadString(url);
            //Console.WriteLine(Getvals(source));
            string matchPattern =
                    @"<a.rel=""nofollow"".style=""font-size:0.8em;"".href=[""'](?<url>[^""^']+[.]*)[""'].class=""offsite"".*>(?<name>[^<]+[.]*)</a>";
            foreach (Hashtable grouping in ExtractGroupings(source, matchPattern, true))
            {
                foreach (DictionaryEntry DE in grouping)
                {
                    Console.WriteLine("Value = " + DE.Value);
                    Console.WriteLine("");
                }
            }
            // End.
            Console.ReadLine();
        }
        public static ArrayList ExtractGroupings(string source, string matchPattern, bool wantInitialMatch)
        {
            ArrayList keyedMatches = new ArrayList();
            int startingElement = 1;
            if (wantInitialMatch)
            {
                startingElement = 0;
            }
            Regex RE = new Regex(matchPattern, RegexOptions.Multiline);
            MatchCollection theMatches = RE.Matches(source);
            foreach (Match m in theMatches)
            {
                Hashtable groupings = new Hashtable();
                for (int counter = startingElement; counter < m.Groups.Count; counter++)
                {
                    // If we had just returned the MatchCollection directly, the
                    // GroupNameFromNumber method would not be available to use
                    groupings.Add(RE.GroupNameFromNumber(counter),
                    m.Groups[counter]);
                }
                keyedMatches.Add(groupings);
            }
            return (keyedMatches);
        }
    }
}

但在这里我遇到一个问题,正在显示三次,当我执行的每一个网址,这是第一个得到显示整个锚标记,下一个被两次显示的URL。任何人都可以建议我在哪里,我应该纠正,这样我就可以有每一个URL显示一次。

But here I face a problem, when I'm executing each URL is being displayed thrice, That's first the whole anchor tag is getting displayed, next the URL is being displayed twice. can anyone suggest me where should I correct so that I can have each URL displayed exactly once.

推荐答案

在你的正则表达式,你有两个分组,并且整场比赛。如果我正确地阅读它,你应该只希望匹配的URL部分,这是3个分组的第二....

in your regex, you have two groupings, and the entire match. If I'm reading it correctly, you should only want the URL portion of the matches, which is the second of the 3 groupings....

for (int counter = startingElement; counter < m.Groups.Count; counter++)
            {
                // If we had just returned the MatchCollection directly, the
                // GroupNameFromNumber method would not be available to use
                groupings.Add(RE.GroupNameFromNumber(counter),
                m.Groups[counter]);
            }

难道你不想这样?

don't you want this?:

groupings.Add(RE.GroupNameFromNumber(1),m.Groups[1]);

这篇关于在.NET中使用正则表达式提取网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆