弹出窗口使用wget阻止从网站批量下载pdf [英] Popups block bulk download of pdfs from website with wget
问题描述
我想从网站,其中wget
使用以下bash
脚本:
I would like to download some free-to-download pdfs (copies of old newspaper) from this website of the Austrian National Library with wget
using the bash
script below:
for year in {14..57}; do
for month in `seq -w 1 12`; do # -w for leading zero
for day in `seq -w 1 31`; do
wget -A pdf -nc -E -nd --no-check-certificate --content-disposition http://anno.onb.ac.at/pdfs/ONB_lzg_18$year$month$day.pdf
done
done
done
除了某些报纸问题外,即使存在,我也无法下载任何问题.我会遇到诸如1814年6月30日的现有问题这样的错误:
Aside of some newspaper issues not being available, I cannot download any issues even though they exist. I would get errors such as the one for the existing issue of June 30, 1814 for example:
http://anno.onb.ac.at/pdfs/ONB_lzg_18140630.pdf
Aufl"osen des Hostnamens anno.onb.ac.at (anno.onb.ac.at)... 193.170.112.230
Verbindungsaufbau zu anno.onb.ac.at (anno.onb.ac.at)|193.170.112.230|:80 ... verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet ... 404 Not Found
FEHLER 404: Not Found.
但是,如果您要手动下载相应的pdf文件(wget毫无问题地下载问题.
However, if you were to download the corresponding pdfs manually (here, see upper-right corner) you have to press "ok" in a pop-up acknowledgement. Once you did this, I can even download the issue via wget
without a problem.
我如何告诉wget通过命令行确认确认(要下载pdf时出现的问题),请参见下面的屏幕截图?在wget中有命令吗?
How can I tell wget to confirm via the command line the acknowledgements (the question you get once you want to download a pdf), see screenshot below? Is there a command in wget for that?
推荐答案
您的代码中有两个问题.
There are two issues in your code.
-
lgz
报纸并非在所有日期都可用 - 并非总是生成PDF并将其缓存在您使用的URL上.您需要先运行其他URL,以确保生成了PDF
lgz
newspaper is not available for all the dates- The PDF are not always generated and cached on the URL you used. You need to first run the other URL to make sure the PDF is generated
下面是应该工作的更新代码
Below is the updated code that should work
#!/bin/bash
for year in {14..57}; do
DATES=$(curl -sS "http://anno.onb.ac.at/cgi-content/anno?aid=lzg&datum=18$year&zoom=33" | gawk 'match($0, /datum=([^&]+)/, ary) {print ary[1]}' | xargs echo)
for date in $DATES
do
echo "Downloading for $date"
curl "http://anno.onb.ac.at/cgi-content/anno_pdf.pl?aid=lzg&datum=$date" -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' -H 'DNT: 1' -H "Referer: http://anno.onb.ac.at/cgi-content/anno?aid=lzg&datum=$date" -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: en-US,en;q=0.9' --compressed
wget -A pdf -nc -E -nd --no-check-certificate --content-disposition http://anno.onb.ac.at/pdfs/ONB_lzg_$date.pdf
done
done
这篇关于弹出窗口使用wget阻止从网站批量下载pdf的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!