制作网络爬虫/蜘蛛 [英] Make a web crawler/spider

查看:36
本文介绍了制作网络爬虫/蜘蛛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在考虑制作网络爬虫/蜘蛛,但我需要有人为我指明正确的方向才能开始.

I'm looking into making a web crawler/spider but I need someone to point me in the right direction to get started.

基本上,我的蜘蛛将搜索音频文件并将它们编入索引.

Basically, my spider is going to search for audio files and index them.

我只是想知道是否有人对我应该如何做有任何想法.我听说用 PHP 完成它会非常慢.我知道 vb.net 那么它可以派上用场吗?

I'm just wondering if anyone has any ideas for how I should do it. I've heard having it done in PHP would be extremely slow. I know vb.net so could that come in handy?

我正在考虑使用 Google 的文件类型搜索来获取要抓取的链接.可以吗?

I was thinking about using Googles filetype search to get links to crawl. Would that be ok?

推荐答案

在 VB.NET 中,您需要先获取 HTML,因此请使用 WebClient 类或 HttpWebRequest 和 HttpWebResponse 类.网上有很多关于如何使用这些的信息.

In VB.NET you will need to get the HTML first, so use the WebClient class or HttpWebRequest and HttpWebResponse classes. There is plenty of info on how to use these on the interweb.

然后您将需要解析 HTML.我建议为此使用正则表达式.

Then you will need to parse the HTML. I recommend using regular expressions for this.

您使用 Google 进行文件类型搜索的想法很好.几年前,我做了类似的事情,收集 PDF 以测试 SharePoint 中的 PDF 索引,效果非常好.

Your idea of using Google for a filetype search is a good one. I did a similar thing a few years ago to gather PDFs to test PDF indexing in SharePoint, which worked really well.

这篇关于制作网络爬虫/蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆