Jsoup获取动态生成的HTML [英] Jsoup get dynamically generated HTML

查看:794
本文介绍了Jsoup获取动态生成的HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以连接到大多数网站并获得HTML,但是当尝试连接到使用JavaScript加载初始页面后生成大部分内容的网站时,它不会获得任何数据。是否有任何方法可以使用Jsoup或不支持它?

I can connect to most sites and get the HTML just fine but when trying to connect to a website where most of the content is generated after the initial page load with JavaScript, it does not get any of that data. Is there any way to do this with Jsoup or does it not support it?

推荐答案

JSoup包含一些基本的连接处理,但是它不是一个Web浏览器。它擅长解析静态html内容。它没有运行任何JavaScript,所以你运气不好。但是,您可以遵循以下不同的选项:

JSoup has some basic connection handling included, but it is not a web browser. It excels at parsing static html content. It does not run any javascript, so you are out of luck. However, there are different options that you might follow:


  1. 您可以分析要检索的页面并了解如何您感兴趣的内容会被加载。通常,点击加载内容的原始来源并使用它并不是很困难。这种方法的好处是,您可以获得所需的内容而无需额外的库,并且检索速度很快。

  1. You can analyze the page that you want to retrieve and find out how the content you are interested in gets loaded. Often it is not very hard to tap the original source of the loaded content and work with this. This approach has the benefit that you get what you want with no need of extra libraries and the retrieval will be fast.

您可以使用(完整的)浏览器并自动加载页面。一个非常好的工具是将 selenium webdriver 与无头webkit浏览器结合使用 phantomjs 。然而,这需要项目中额外的软件和额外的库,并且运行速度比第一个解决方案慢得多。

You can use a (full) browser and automate the loading of the page. A very good tool for this is selenium webdriver in combination with the headless webkit browser phantomjs. This however requires extra software and extra libraries in your project and will run much much slower than the first solution.

这篇关于Jsoup获取动态生成的HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆