制作网络爬虫/蜘蛛 [英] Make a web crawler/spider
问题描述
我正在考虑制作网络爬虫/蜘蛛,但我需要有人为我指明正确的方向才能开始.
I'm looking into making a web crawler/spider but I need someone to point me in the right direction to get started.
基本上,我的蜘蛛将搜索音频文件并将它们编入索引.
Basically, my spider is going to search for audio files and index them.
我只是想知道是否有人对我应该如何做有任何想法.我听说用 PHP 完成它会非常慢.我知道 vb.net 那么它可以派上用场吗?
I'm just wondering if anyone has any ideas for how I should do it. I've heard having it done in PHP would be extremely slow. I know vb.net so could that come in handy?
我正在考虑使用 Google 的文件类型搜索来获取要抓取的链接.可以吗?
I was thinking about using Googles filetype search to get links to crawl. Would that be ok?
推荐答案
在 VB.NET 中,您需要先获取 HTML,因此请使用 WebClient 类或 HttpWebRequest 和 HttpWebResponse 类.网上有很多关于如何使用这些的信息.
In VB.NET you will need to get the HTML first, so use the WebClient class or HttpWebRequest and HttpWebResponse classes. There is plenty of info on how to use these on the interweb.
然后您将需要解析 HTML.我建议为此使用正则表达式.
Then you will need to parse the HTML. I recommend using regular expressions for this.
您使用 Google 进行文件类型搜索的想法很好.几年前,我做了类似的事情,收集 PDF 以测试 SharePoint 中的 PDF 索引,效果非常好.
Your idea of using Google for a filetype search is a good one. I did a similar thing a few years ago to gather PDFs to test PDF indexing in SharePoint, which worked really well.
这篇关于制作网络爬虫/蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!