如何最好地开发网络爬虫 [英] How to best develop web crawlers

查看:25
本文介绍了如何最好地开发网络爬虫的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我习惯于创建一些爬虫来编译信息,当我来到一个网站时,我需要这些信息,我启动了一个特定于该站点的新爬虫,大部分时间使用 shell 脚本,有时使用 PHP.

I am used to create some crawlers to compile information and as I come to a website I need the info I start a new crawler specific for that site, using shell scripts most of the time and sometime PHP.

我的做法是用一个简单的for来迭代页面列表,一个wget下载它并sedtrawk 或其他实用程序来清理页面并获取我需要的特定信息.

The way I do is with a simple for to iterate for the page list, a wget do download it and sed, tr, awk or other utilities to clean the page and grab the specific info I need.

所有过程都需要一些时间,具体取决于站点,下载所有页面需要更多时间.我经常进入一个让一切都复杂化的 AJAX 网站

All the process takes some time depending on the site and more to download all pages. And I often steps into an AJAX site that complicates everything

我想知道是否有更好的方法、更快的方法或什至一些应用程序或语言来帮助此类工作.

I was wondering if there is better ways to do that, faster ways or even some applications or languages to help such work.

推荐答案

使用正则表达式来解析内容是一个坏主意,这里的问题已经被无数次提及.

Using regular expressions for parsing content is a bad idea that has been covered in questions here countless times.

您应该将文档解析为 DOM 树,然后您可以拉出任何您想要的超链接、样式表、脚本文件、图像或其他外部链接,并相应地遍历它们.

You should be parsing the document into a DOM tree and then you can pull out any hyperlinks, stylesheets, script files, images or other external links that you want and traverse them accordingly.

许多脚本语言都有用于获取网页(例如用于 PHP 的 curl)和用于解析 HTML(例如用于 Python 的 Beautiful Soup)的包.走那条路,而不是正则表达式匹配的笨拙解决方案.

Many scripting languages have packages for getting Web pages (eg curl for PHP) and for parsing HTML (eg Beautiful Soup for Python). Go that route instead of the hackky solution of regular expression matching.

这篇关于如何最好地开发网络爬虫的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆