我怎么会下载各种文件类型从一个网站? [英] How would I download all kinds of file types from a website?
问题描述
我有一个新的类下面的代码:
I have the following code in a new class:
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using HtmlAgilityPack;
using System.IO;
using System.Text.RegularExpressions;
using System.Xml.Linq;
using System.Net;
using System.Web;
using System.Threading;
using DannyGeneral;
using GatherLinks;
namespace GatherLinks
{
class RetrieveWebContent
{
HtmlAgilityPack.HtmlDocument doc;
string imgg;
int images;
public RetrieveWebContent()
{
images = 0;
}
public List<string> retrieveImages(string address)
{
try
{
doc = new HtmlAgilityPack.HtmlDocument();
System.Net.WebClient wc = new System.Net.WebClient();
List<string> imgList = new List<string>();
doc.Load(wc.OpenRead(address));
HtmlNodeCollection imgs = doc.DocumentNode.SelectNodes("//img[@src]");
if (imgs == null) return new List<string>();
foreach (HtmlNode img in imgs)
{
if (img.Attributes["src"] == null)
continue;
HtmlAttribute src = img.Attributes["src"];
imgList.Add(src.Value);
if (src.Value.StartsWith("http") || src.Value.StartsWith("https") || src.Value.StartsWith("www"))
{
images++;
string[] arr = src.Value.Split('/');
imgg = arr[arr.Length - 1];
wc.DownloadFile(src.Value, @"d:\MyImages\" + imgg);
}
}
return imgList;
}
catch
{
Logger.Write("There Was Problem Downloading The Image: " + imgg);
return null;
}
}
}
}
上面的代码是我的WebCrawler的一部分。这个代码将会从网站下载只有图像文件
The above code is part of my WebCrawler. This code will download only image files from a website.
例如,我有这个网站:
的 http://web.archive.org/web/20131216195236/http:// open-hardware-monitor.googlecode.com/svn/trunk/
For example, I have this site: http://web.archive.org/web/20131216195236/http://open-hardware-monitor.googlecode.com/svn/trunk/
在上述网站包含的是一个名为应用<文件/ code>。如果我右键单击它并
另存为
,然后我看到它是一个配置文件。如果我点击硬件/
链接,然后我看到许多* .CS文件。
Contained in the aforementioned site is a file named App
. If I right click it and save as
, then I see that it's a config file. If I click on the Hardware/
link, then I see many *.CS files.
我怎样才能使和/或更新我的代码,以便它会下载各种文件类型,而不是只下载图像?
How can I make and/or update my code so that it will download all kinds of file types rather than only downloading images?
推荐答案
眼下以下行:
HtmlNodeCollection imgs = doc.DocumentNode.SelectNodes("//img[@src]");
时抓住所有的图像标记和处理它们。您将需要找到一种方法寻找到HREF扩展等于所有锚标签的.cs
这将是类似于上面的线。我建议你阅读了XPath的,因为这似乎是什么的SelectNodes
使用找到的元素。
It will be similar to the line above. I recommend reading up on xPath, since that appears to be what SelectNodes
is using to find elements.
希望这帮助您开始!
这篇关于我怎么会下载各种文件类型从一个网站?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!