C#中的简单Web搜寻器 [英] Simple web crawler in C#
本文介绍了C#中的简单Web搜寻器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我已经创建了一个简单的Web爬网程序,但是我想添加递归功能,以便每个打开的页面都可以获取此页面中的网址,但是我不知道该怎么做,我还想包括加快速度的线程
这是我的代码
I have created a simple web crawler but i want to add the recursion function so that every page that is opened i can get the urls in this page,but i have no idea how i can do that and i want also to include threads to make it faster here it is my code
namespace Crawler
{
public partial class Form1 : Form
{
String Rstring;
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
WebRequest myWebRequest;
WebResponse myWebResponse;
String URL = textBox1.Text;
myWebRequest = WebRequest.Create(URL);
myWebResponse = myWebRequest.GetResponse();//Returns a response from an Internet resource
Stream streamResponse = myWebResponse.GetResponseStream();//return the data stream from the internet
//and save it in the stream
StreamReader sreader = new StreamReader(streamResponse);//reads the data stream
Rstring = sreader.ReadToEnd();//reads it to the end
String Links = GetContent(Rstring);//gets the links only
textBox2.Text = Rstring;
textBox3.Text = Links;
streamResponse.Close();
sreader.Close();
myWebResponse.Close();
}
private String GetContent(String Rstring)
{
String sString="";
HTMLDocument d = new HTMLDocument();
IHTMLDocument2 doc = (IHTMLDocument2)d;
doc.write(Rstring);
IHTMLElementCollection L = doc.links;
foreach (IHTMLElement links in L)
{
sString += links.getAttribute("href", 0);
sString += "/n";
}
return sString;
}
推荐答案
我将GetContent方法固定为请按照以下步骤从抓取的页面中获取新链接:
I fixed your GetContent method as follow to get new links from crawled page:
public ISet<string> GetNewLinks(string content)
{
Regex regexLink = new Regex("(?<=<a\\s*?href=(?:'|\"))[^'\"]*?(?=(?:'|\"))");
ISet<string> newLinks = new HashSet<string>();
foreach (var match in regexLink.Matches(content))
{
if (!newLinks.Contains(match.ToString()))
newLinks.Add(match.ToString());
}
return newLinks;
}
已更新
已修复:正则表达式应为regexLink。感谢@shashlearner指出这一点(我的输入错误)。
Fixed: regex should be regexLink. Thanks @shashlearner for pointing this out (my mistype).
这篇关于C#中的简单Web搜寻器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文