网页抓取和 robots.txt [英] Web crawling and robots.txt

查看：36 发布时间：2021/9/24 20:13:49 php wget

本文介绍了网页抓取和 robots.txt的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我使用 wget 来下载"一个站点.

I used wget to 'download' a site.

i) 它返回一个 .css 文件、一个 .js 文件、index.php 和一个图像 img1.jpg

i) It returns a .css file, a .js file, and index.php and an image img1.jpg

ii) 但是，xyz.com 下还有更多图片.我输入了www.xyz.com/Img2.jpg，因此是

ii) However, there exist more images under xyz.com. I typed www.xyz.com/Img2.jpg and hence

有一张图片.

iii) 但是 index.php 指的是单个图像，即 img1.jpg.

iii) But index.php refers to a single image, i.e. img1.jpg.

iv) 随附一个包含 Disallow:

iv) A robot file accompanies it that contains Disallow:

应该在命令行中进行哪些更改以返回xyz.com下的所有内容，而不是

What change should be made in the command line to return everything under xyz.com, that are not

在 index.php 中引用，但在目录中是静态的.

referenced in index.php, but are static in the directory.