通过 wget 命令抓取 sitemap.xml 的链接 [英] crawl links of sitemap.xml through wget command

查看：45 发布时间：2021/9/22 20:28:24 wget web-crawler sitemap.xml

本文介绍了通过 wget 命令抓取 sitemap.xml 的链接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我尝试抓取 sitemap.xml 的所有链接以重新缓存网站.但是 wget 的递归选项不起作用，我只能得到响应:

I try to crawl all links of a sitemap.xml to re-cache a website. But the recursive option of wget does not work, I only get as respond:

远程文件存在但不包含任何链接 - 无法检索.

Remote file exists but does not contain any link -- not retrieving.

但可以肯定的是，sitemap.xml 充满了http://..."链接.

But for sure the sitemap.xml is full of "http://..." links.

我几乎尝试了 wget 的所有选项，但没有任何效果:

I tried almost every option of wget but nothing worked for me:

wget -r --mirror http://mysite.com/sitemap.xml

有谁知道如何打开网站sitemap.xml中的所有链接?

Does anyone knows how to open all links inside of a website sitemap.xml?

谢谢，多米尼克

wget 似乎无法解析 XML.因此，您必须手动提取链接.你可以这样做:

It seems that wget can't parse XML. So, you'll have to extract the links manually. You could do something like this:

wget --quiet http://www.mysite.com/sitemap.xml --output-document - | egrep -o "https?://[^<]+" | wget -i -

我在这里.

这篇关于通过 wget 命令抓取 sitemap.xml 的链接的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文