Java-以编程方式获取与域名相关联的每个网页 [英] Java - Get every webpage associated with domain name programmatically

查看:69
本文介绍了Java-以编程方式获取与域名相关联的每个网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想制作一个用户输入URL的程序,该程序以该域名下关联的每个网页作为响应.现在,我正在使用Jsoup来获取每个<a href>链接,但是,如果该站点通过AngularJS或其他方式更改页面,则该覆盖并不总是覆盖该站点上的每个网页.关于如何最好地做到这一点的任何建议?

I want to make a program where a user enters a URL, and the program responds with every web page associated under that domain name. Right now, I'm using Jsoup to get every <a href> link, but that does not always cover every web page on a site if the site changes pages through AngularJS or something else. Any advice on how best to do this?

推荐答案

您不需要jsoup.只需导航到主机的robots.txt

You don't need jsoup for this. Just navigate to the host's robots.txt

https://stackoverflow.com/robots.txt

并找到sitemap.xml.

Sitemap: /sitemap.xml

对于SO,它们已在Google上缓存:

这将包含网站希望公开可用的所有链接.如果是SO,则需要扫描其他站点地图的列表.

This will have all of the links the website wants to be publicly available. Or in the case of SO, a list of additional site maps to scan through.

https://stackoverflow.com/sitemap-questions-0.xml      
https://stackoverflow.com/sitemap-questions-1.xml 
https://stackoverflow.com/sitemap-questions-2.xml 
https://stackoverflow.com/sitemap-questions-3.xml 
https://stackoverflow.com/sitemap-questions-4.xml 
https://stackoverflow.com/sitemap-questions-5.xml 
https://stackoverflow.com/sitemap-questions-6.xml
....

这篇关于Java-以编程方式获取与域名相关联的每个网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆