如何使用 --accept-regex 选项通过 wget 下载网站? [英] How do I use the --accept-regex option for downloading a website with wget?

查看:23
本文介绍了如何使用 --accept-regex 选项通过 wget 下载网站?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 wget 下载我网站的存档 - 3dsforums.com,但有数百万我不想下载的页面,所以我试图告诉 wget 只下载匹配特定 URL 模式的页面,但我遇到了一些障碍.

I'm trying to download an archive of my website — 3dsforums.com — using wget, but there are millions of pages I don't want to download, so I'm trying to tell wget to only download pages that match certain URL patterns, and yet I'm running into some roadblocks.

举个例子,这是一个我想下载的网址:

As an example, this is a URL I would like to download:

http://3dsforums.com/forumdisplay.php?f=46

...所以我尝试使用 --accept-regex 选项:

...so I've tried using the --accept-regex option:

wget -mkEpnp --accept-regex "(forumdisplay\.php\?f=(\d+)$)" http://3dsforums.com

但它只是下载网站的主页.

But it just downloads the home page of the website.

目前唯一可以远程运行的命令如下:

The only command that remotely works so far is the following:

wget -mkEpnp --accept-regex "(\w+\.php$)" http://3dsforums.com

这提供了以下响应:

Downloaded 9 files, 215K in 0.1s (1.72 MB/s)
Converting links in 3dsforums.com/faq.php.html... 16-19
Converting links in 3dsforums.com/index.html... 8-88
Converting links in 3dsforums.com/sendmessage.php.html... 14-15
Converting links in 3dsforums.com/register.php.html... 13-14
Converting links in 3dsforums.com/showgroups.php.html... 14-29
Converting links in 3dsforums.com/index.php.html... 16-80
Converting links in 3dsforums.com/calendar.php.html... 17-145
Converting links in 3dsforums.com/memberlist.php.html... 14-99
Converting links in 3dsforums.com/search.php.html... 15-16
Converted links in 9 files in 0.009 seconds.

我的正则表达式有问题吗?还是我误解了 --accept-regex 选项的使用?我今天一直在尝试各种变化,但我不太明白实际问题是什么.

Is there something wrong with my regular expressions? Or am I misunderstanding the use of the --accept-regex option? I've been trying all sorts of variations today but I'm not quite grasping what the actual problem is.

推荐答案

wget 默认使用 POSIX 正则表达式 \d 类表示为 [:digit:]\w 类表示为 [:word:] ,加上为什么所有的分组?如果您的 wget 是使用 PCRE 支持编译的,那么您的生活会更轻松,请按照以下方式操作:

wget by default uses POSIX regex \d class is expressed as [:digit:] and \w class is expressed as [:word:], plus why all the grouping? If your wget is compiled with PCRE support make your life easier and do it as:

wget -mkEpnp --regex-type pcre --accept-regex "forumdisplay.php\?f=\d+$" http://3dsforums.com

wget -mkEpnp --regex-type pcre --accept-regex "forumdisplay.php\?f=\d+$" http://3dsforums.com

但是...这将不起作用,因为您的论坛软件会创建自动会话 ID(s=)并将它们注入所有链接中,因此您需要将它们作为嗯:

but... that will not work because your forum software creates automatic session IDs (s=<session_id>) and injects them in all the links, so you need to account for those as well:

wget -mkEpnp --regex-type pcre --accept-regex "forumdisplay\.php\?(s=.*)?f=\d+(s=.*)?$" http://3dsforums.com

唯一的问题是,现在您的文件将使用名称中的会话 ID 保存,因此您必须在 wget 完成后添加另一个步骤 - 使用他们名称中的会话 ID.您可能可以通过将 wget 传递给 sed 来实现,但我会留给您 :)

The only problem is that now your files will be saved with the session ID in their names so you'll have to add another step when wget is finished - to bulk rename all the files with the session ID in their names. You could probably do it by piping wget to sed, but I'll leave that to you :)

如果你的 wget 不支持 PCRE,这个模式最终会很长,但希望它支持...

And if your wget doesn't support PCRE this pattern will end up being quite long, but lets hope it does...

这篇关于如何使用 --accept-regex 选项通过 wget 下载网站?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆