弹出窗口使用wget阻止从网站批量下载pdf [英] Popups block bulk download of pdfs from website with wget

查看:104
本文介绍了弹出窗口使用wget阻止从网站批量下载pdf的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从网站,其中wget使用以下bash脚本:

I would like to download some free-to-download pdfs (copies of old newspaper) from this website of the Austrian National Library with wget using the bash script below:

for year in {14..57}; do
  for month in `seq -w 1 12`; do # -w for leading zero
    for day in `seq -w 1 31`; do
      wget -A pdf -nc -E -nd --no-check-certificate --content-disposition http://anno.onb.ac.at/pdfs/ONB_lzg_18$year$month$day.pdf
    done
  done
done

除了某些报纸问题外,即使存在,我也无法下载任何问题.我会遇到诸如1814年6月30日的现有问题这样的错误:

Aside of some newspaper issues not being available, I cannot download any issues even though they exist. I would get errors such as the one for the existing issue of June 30, 1814 for example:

http://anno.onb.ac.at/pdfs/ONB_lzg_18140630.pdf
Aufl"osen des Hostnamens anno.onb.ac.at (anno.onb.ac.at)... 193.170.112.230
Verbindungsaufbau zu anno.onb.ac.at (anno.onb.ac.at)|193.170.112.230|:80 ... verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet ... 404 Not Found
FEHLER 404: Not Found.

但是,如果您要手动下载相应的pdf文件(wget毫无问题地下载问题.

However, if you were to download the corresponding pdfs manually (here, see upper-right corner) you have to press "ok" in a pop-up acknowledgement. Once you did this, I can even download the issue via wget without a problem.

我如何告诉wget通过命令行确认确认(要下载pdf时出现的问题),请参见下面的屏幕截图?在wget中有命令吗?

How can I tell wget to confirm via the command line the acknowledgements (the question you get once you want to download a pdf), see screenshot below? Is there a command in wget for that?

推荐答案

您的代码中有两个问题.

There are two issues in your code.

  1. lgz报纸并非在所有日期都可用
  2. 并非总是生成PDF并将其缓存在您使用的URL上.您需要先运行其他URL,以确保生成了PDF
  1. lgz newspaper is not available for all the dates
  2. The PDF are not always generated and cached on the URL you used. You need to first run the other URL to make sure the PDF is generated

下面是应该工作的更新代码

Below is the updated code that should work

#!/bin/bash

for year in {14..57}; do
  DATES=$(curl -sS "http://anno.onb.ac.at/cgi-content/anno?aid=lzg&datum=18$year&zoom=33" |   gawk 'match($0, /datum=([^&]+)/, ary) {print ary[1]}' | xargs echo)

  for date in $DATES
  do 
      echo "Downloading for $date"

      curl "http://anno.onb.ac.at/cgi-content/anno_pdf.pl?aid=lzg&datum=$date" -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' -H 'DNT: 1' -H "Referer: http://anno.onb.ac.at/cgi-content/anno?aid=lzg&datum=$date" -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: en-US,en;q=0.9' --compressed

      wget -A pdf -nc -E -nd --no-check-certificate --content-disposition http://anno.onb.ac.at/pdfs/ONB_lzg_$date.pdf
  done
done

这篇关于弹出窗口使用wget阻止从网站批量下载pdf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆