如何获取域上的所有网页 [英] How to get all webpages on a domain

查看:298
本文介绍了如何获取域上的所有网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在制作一个简单的网络蜘蛛,我想知道是否有一种可以在我的PHP代码中触发的方法,我可以获取域上的所有网页...

I am making a simple web spider and I was wondering if there is a way that can be triggered in my PHP code that I can get all the webpages on a domain...

例如说我想在Stackoverflow.com上获取所有网页。这意味着它会得到:
https://stackoverflow.com/questions/ask
从成人网站拉扯网页 - 如何超越网站协议?
https://stackoverflow.com/questions/1234214/
最佳Rails HTML解析器

e.g Lets say I wanted to get all the webpages on Stackoverflow.com . That means that it would get: https://stackoverflow.com/questions/ask pulling webpages from an adult site -- how to get past the site agreement? https://stackoverflow.com/questions/1234214/ Best Rails HTML Parser

所有链接。我怎么能得到这个还是有一个 API DIRECTORY 可以让我得到这个?

And all the links. How can I get that. Or is there an API or DIRECTORY that can enable me to get that?

还有一种方法可以获取所有子域

Also is there a way I can get all the subdomains?

如何抓取抓取没有 SiteMaps Syndication feeds

干杯。

推荐答案

您可以这样做,他们可能会提供 Sitemap 。使用站点地图的组合和页面上的链接,您应该可以遍历站点上的所有页面,但这取决于站点的所有者以及它们的访问方式。

If a site wants you to be able to do this, they will probably provide a Sitemap. Using a combination of a sitemap and following the links on pages, you should be able to traverse all the pages on a site - but this is really up to the owner of the site, and how accessible they make it.

如果网站不不希望你这样做,那么你无法解决这个问题。 HTTP不提供用于列出目录内容的任何标准机制。

If the site does not want you to do this, there is nothing you can do to work around it. HTTP does not provide any standard mechanism for listing the contents of a directory.

这篇关于如何获取域上的所有网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆