C#中的简单Web搜寻器 [英] Simple web crawler in C#

查看:123
本文介绍了C#中的简单Web搜寻器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经创建了一个简单的Web爬网程序,但是我想添加递归功能,以便每个打开的页面都可以获取此页面中的网址,但是我不知道该怎么做,我还想包括加快速度的线程
这是我的代码

I have created a simple web crawler but i want to add the recursion function so that every page that is opened i can get the urls in this page,but i have no idea how i can do that and i want also to include threads to make it faster here it is my code

namespace Crawler
{
    public partial class Form1 : Form
    {
        String Rstring;

        public Form1()
        {
            InitializeComponent();
        }

        private void button1_Click(object sender, EventArgs e)
        {

            WebRequest myWebRequest;
            WebResponse myWebResponse;
            String URL = textBox1.Text;

            myWebRequest =  WebRequest.Create(URL);
            myWebResponse = myWebRequest.GetResponse();//Returns a response from an Internet resource

            Stream streamResponse = myWebResponse.GetResponseStream();//return the data stream from the internet
                                                                       //and save it in the stream

            StreamReader sreader = new StreamReader(streamResponse);//reads the data stream
            Rstring = sreader.ReadToEnd();//reads it to the end
            String Links = GetContent(Rstring);//gets the links only

            textBox2.Text = Rstring;
            textBox3.Text = Links;
            streamResponse.Close();
            sreader.Close();
            myWebResponse.Close();




        }

        private String GetContent(String Rstring)
        {
            String sString="";
            HTMLDocument d = new HTMLDocument();
            IHTMLDocument2 doc = (IHTMLDocument2)d;
            doc.write(Rstring);

            IHTMLElementCollection L = doc.links;

            foreach (IHTMLElement links in  L)
            {
                sString += links.getAttribute("href", 0);
                sString += "/n";
            }
            return sString;
        }


推荐答案

我将GetContent方法固定为请按照以下步骤从抓取的页面中获取新链接:

I fixed your GetContent method as follow to get new links from crawled page:

public ISet<string> GetNewLinks(string content)
{
    Regex regexLink = new Regex("(?<=<a\\s*?href=(?:'|\"))[^'\"]*?(?=(?:'|\"))");

    ISet<string> newLinks = new HashSet<string>();    
    foreach (var match in regexLink.Matches(content))
    {
        if (!newLinks.Contains(match.ToString()))
            newLinks.Add(match.ToString());
    }

    return newLinks;
}

已更新

已修复:正则表达式应为regexLink。感谢@shashlearner指出这一点(我的输入错误)。

Fixed: regex should be regexLink. Thanks @shashlearner for pointing this out (my mistype).

这篇关于C#中的简单Web搜寻器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆