获取网页内容(通过AJAX调用加载) [英] Fetch contents(loaded through AJAX call) of a web page
问题描述
我是初学者.我需要从链接中获取帖子和评论.我想自动化这个过程.我曾考虑为此使用webcrawler和jsoup,但被告知webcrawlers主要用于深度更深的网站.
I am a beginner to crawling. I have a requirement to fetch the posts and comments from a link. I want to automate this process. I considered using webcrawler and jsoup for this but was told that webcrawlers are mostly used for websites with greater depth.
页面示例:Jive社区网站
Sample for a page: Jive community website
对于此页面,当我查看页面源代码时,我只能看到帖子,而看不到评论.认为这是因为注释是通过对服务器的AJAX调用获取的.
For this page, when I view the source of the page, I can see only the post and not the comments. Think this is because comments are fetched through an AJAX call to the server.
因此,当我使用jsoup时,它不会获取注释.
Hence, when I use jsoup, it doesn't fetch the comments.
那么我该如何自动化获取帖子和评论的过程?
So how can I automate the process of fetching posts and comments?
推荐答案
Jsoup仅是 html 解析器.不幸的是,由于jsoup无法执行这些内容,因此无法解析任何javascript/ajax内容.
Jsoup is a html parser only. Unfortunately it's not possible to parse any javascript / ajax content, since jsoup can't execute those.
解决方案:使用可以处理脚本的库.
The solution: using a library which can handle Scripts.
以下是一些我知道的例子:
Here are some examples i know:
- HtmlUnit
- Java Script Engine
- Apache Commons BSF
- Rhino
如果这样的库不支持解析或选择器,则至少可以使用它们从脚本中提取Html(然后可以通过jsoup对其进行解析).
If such a library doesn't support parsing or selectors, you can at least use them to get Html out of the scripts (which then can be parsed by jsoup).
这篇关于获取网页内容(通过AJAX调用加载)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!