有关抓取网站内容的建议 [英] Advice with crawling web site content

查看:143
本文介绍了有关抓取网站内容的建议的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用 jsoup 和java组合抓取部分网站内容。将相关详细信息保存到我的数据库并每天执行相同的活动。

I was trying to crawl some of website content, using jsoup and java combination. Save the relevant details to my database and doing the same activity daily.

但这是交易,当我在浏览器中打开网站时,我得到了html(包含所有元素)那里的标签)。 javascript部分,当我测试它,它工作得很好(我应该用来提取正确的数据)。

But here is the deal, when I open the website in browser I get rendered html (with all element tags out there). The javascript part when I test it, it works just fine (the one which I'm supposed to use to extract the correct data).

但是当我做一个解析/ get with jsoup(来自Java类),只下载初始网站进行解析。这意味着网站有一些动态部分,我想获得这些数据,但由于它们是在网站上异步发布的,我无法用jsoup捕获它。

But when I do a parse/get with jsoup(from Java class), only the initial website is downloaded for parsing. Meaning there are some dynamic parts of a website and I want to get that data but since they're rendered post get, asynchronously on the website I'm unable to capture it with jsoup.

有人知道解决这个问题吗?我使用的是正确的工具集吗?更有经验的人,我提出你的意见。

Does anybody knows a way around this? Am I using the right toolset? more experienced people, I bid your advice.

推荐答案

你需要先检查一下你所抓的网站是否需要这个列表显示所有内容:

You need to check before if the website you're crawling demands some of this list to show all contents:


  • 使用登录名/密码进行身份验证

  • 某种会话验证HTTP标头

  • Cookie

  • 加载所有内容的某种时间延迟(Javascript库,CSS和异步数据的网站可能需要这个)。

  • 特定用户代理浏览器

  • 代理密码,例如,如果您在公司网络安全配置中。

  • Authentication with Login/Password
  • Some sort of session validation on HTTP headers
  • Cookies
  • Some sort of time delay to load all the contents (sites profuse on Javascript libraries, CSS and asyncronous data may need of this).
  • An specific User-Agent browser
  • A proxy password if, by example, you're inside a corporative network security configuration.

如果需要此列表中的任何内容,您可以管理提供jsoup.connect()中参数的数据。请参考官方文档。

If anything on this list is needed, you can manage that data providing the parameters in your jsoup.connect(). Please refer the official doc.

http:// jsoup.org/cookbook/input/load-document-from-url

这篇关于有关抓取网站内容的建议的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆