使用 wget 下载所有 pdf 文件 [英] Download all pdf files using wget

查看：70 发布时间：2021/9/24 20:12:12 wget

本文介绍了使用 wget 下载所有 pdf 文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下网站 http://www.asd.com.tr.我想将所有 PDF 文件下载到一个目录中.我尝试了几个命令，但运气不佳.

I have the following site http://www.asd.com.tr. I want to download all PDF files into one directory. I've tried a couple of commands but am not having much luck.

$ wget --random-wait -r -l inf -nd -A pdf http://www.asd.com.tr/

使用此代码仅下载了四个 PDF 文件.检查此链接，有数千个 PDF 可用:

With this code only four PDF files were downloaded. Check this link, there are over several thousand PDFs available:

http://www.asd.com.tr/Default.aspx

例如，数百个文件位于以下文件夹中:

For instance, hundreds of files are in the following folder:

http://www.asd.com.tr/Folders/asd/...

但我不知道如何正确访问它们以查看和下载它们，此子目录中有一些文件夹，http://www.asd.com.tr/Folders/，以及这些文件夹中的数千个 PDF.

But I can't figure out how to access them correctly to see and download them all, there are some of folders in this subdirectory, http://www.asd.com.tr/Folders/, and thousands of PDFs in these folders.

我尝试使用 -m 命令镜像站点，但它也失败了.

I've tried to mirror site using -m command but it failed too.

还有什么建议吗?

推荐答案

首先，确认网站的 TOS 允许抓取它.然后，一种解决方案是:

First, verify that the TOS of the web site permit to crawl it. Then, one solution is :

mech-dump --links 'http://domain.com' |
    grep pdf$ |
    sed 's/\s+/%20/g' |
    xargs -I% wget http://domain.com/%

mech-dump 命令带有 Perl 的模块 WWW::Mechanize(debian 上的 libwww-mechanize-perl 包发行版)

The mech-dump command comes with Perl's module WWW::Mechanize (libwww-mechanize-perl package on debian & debian likes distros)

这篇关于使用 wget 下载所有 pdf 文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 wget 下载所有 pdf 文件 [英] Download all pdf files using wget

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 wget 下载所有 pdf 文件 [英] Download all pdf files using wget

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭