通过 wget 命令抓取 sitemap.xml 的链接 [英] crawl links of sitemap.xml through wget command
问题描述
我尝试抓取 sitemap.xml 的所有链接以重新缓存网站.但是 wget 的递归选项不起作用,我只能得到响应:
I try to crawl all links of a sitemap.xml to re-cache a website. But the recursive option of wget does not work, I only get as respond:
远程文件存在但不包含任何链接 - 无法检索.
Remote file exists but does not contain any link -- not retrieving.
但可以肯定的是,sitemap.xml 充满了http://..."链接.
But for sure the sitemap.xml is full of "http://..." links.
我几乎尝试了 wget 的所有选项,但没有任何效果:
I tried almost every option of wget but nothing worked for me:
wget -r --mirror http://mysite.com/sitemap.xml
有谁知道如何打开网站sitemap.xml中的所有链接?
Does anyone knows how to open all links inside of a website sitemap.xml?
谢谢,多米尼克
推荐答案
wget
似乎无法解析 XML.因此,您必须手动提取链接.你可以这样做:
It seems that wget
can't parse XML. So, you'll have to extract the links manually. You could do something like this:
wget --quiet http://www.mysite.com/sitemap.xml --output-document - | egrep -o "https?://[^<]+" | wget -i -
我在这里.
这篇关于通过 wget 命令抓取 sitemap.xml 的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!