C＃中的简单Web搜寻器 [英] Simple web crawler in C#

查看：123 发布时间：2020/9/25 21:57:28 c# web-crawler

本文介绍了C＃中的简单Web搜寻器的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经创建了一个简单的Web爬网程序，但是我想添加递归功能，以便每个打开的页面都可以获取此页面中的网址，但是我不知道该怎么做，我还想包括加快速度的线程
这是我的代码

I have created a simple web crawler but i want to add the recursion function so that every page that is opened i can get the urls in this page,but i have no idea how i can do that and i want also to include threads to make it faster here it is my code

namespace Crawler
{
    public partial class Form1 : Form
    {
        String Rstring;

        public Form1()
        {
            InitializeComponent();
        }

        private void button1_Click(object sender, EventArgs e)
        {

            WebRequest myWebRequest;
            WebResponse myWebResponse;
            String URL = textBox1.Text;

            myWebRequest =  WebRequest.Create(URL);
            myWebResponse = myWebRequest.GetResponse();//Returns a response from an Internet resource

            Stream streamResponse = myWebResponse.GetResponseStream();//return the data stream from the internet
                                                                       //and save it in the stream

            StreamReader sreader = new StreamReader(streamResponse);//reads the data stream
            Rstring = sreader.ReadToEnd();//reads it to the end
            String Links = GetContent(Rstring);//gets the links only

            textBox2.Text = Rstring;
            textBox3.Text = Links;
            streamResponse.Close();
            sreader.Close();
            myWebResponse.Close();




        }

        private String GetContent(String Rstring)
        {
            String sString="";
            HTMLDocument d = new HTMLDocument();
            IHTMLDocument2 doc = (IHTMLDocument2)d;
            doc.write(Rstring);

            IHTMLElementCollection L = doc.links;

            foreach (IHTMLElement links in  L)
            {
                sString += links.getAttribute("href", 0);
                sString += "/n";
            }
            return sString;
        }

推荐答案

我将GetContent方法固定为请按照以下步骤从抓取的页面中获取新链接：

I fixed your GetContent method as follow to get new links from crawled page:

public ISet<string> GetNewLinks(string content)
{
    Regex regexLink = new Regex("(?<=<a\\s*?href=(?:'|\"))[^'\"]*?(?=(?:'|\"))");

    ISet<string> newLinks = new HashSet<string>();    
    foreach (var match in regexLink.Matches(content))
    {
        if (!newLinks.Contains(match.ToString()))
            newLinks.Add(match.ToString());
    }

    return newLinks;
}

已更新

已修复：正则表达式应为regexLink。感谢@shashlearner指出这一点（我的输入错误）。

Fixed: regex should be regexLink. Thanks @shashlearner for pointing this out (my mistype).

这篇关于C＃中的简单Web搜寻器的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

C＃中的简单Web搜寻器 [英] Simple web crawler in C#

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

C＃中的简单Web搜寻器 [英] Simple web crawler in C#

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭