网页抓取和 robots.txt [英] Web crawling and robots.txt
问题描述
我使用 wget 来下载"一个站点.
I used wget to 'download' a site.
wget -r http://www.xyz.com
i) 它返回一个 .css 文件、一个 .js 文件、index.php 和一个图像 img1.jpg
i) It returns a .css file, a .js file, and index.php and an image img1.jpg
ii) 但是,xyz.com 下还有更多图片.我输入了www.xyz.com/Img2.jpg,因此是
ii) However, there exist more images under xyz.com. I typed www.xyz.com/Img2.jpg and hence
有一张图片.
iii) 但是 index.php 指的是单个图像,即 img1.jpg.
iii) But index.php refers to a single image, i.e. img1.jpg.
iv) 随附一个包含 Disallow:
iv) A robot file accompanies it that contains Disallow:
应该在命令行中进行哪些更改以返回xyz.com下的所有内容,而不是
What change should be made in the command line to return everything under xyz.com, that are not
在 index.php 中引用,但在目录中是静态的.
referenced in index.php, but are static in the directory.
推荐答案
不可能.wget
如何知道目录中的其他文件,除非您在某处有指向该文件的链接?
Not possible. How should wget
know about other files in the directory unless you have a link to the file somewhere?
这篇关于网页抓取和 robots.txt的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!