获取网页内容(通过AJAX调用加载) [英] Fetch contents(loaded through AJAX call) of a web page

查看:84
本文介绍了获取网页内容(通过AJAX调用加载)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是初学者.我需要从链接中获取帖子和评论.我想自动化这个过程.我曾考虑为此使用webcrawler和jsoup,但被告知webcrawlers主要用于深度更深的网站.

I am a beginner to crawling. I have a requirement to fetch the posts and comments from a link. I want to automate this process. I considered using webcrawler and jsoup for this but was told that webcrawlers are mostly used for websites with greater depth.

页面示例:Jive社区网站

Sample for a page: Jive community website

对于此页面,当我查看页面源代码时,我只能看到帖子,而看不到评论.认为这是因为注释是通过对服务器的AJAX调用获取的.

For this page, when I view the source of the page, I can see only the post and not the comments. Think this is because comments are fetched through an AJAX call to the server.

因此,当我使用jsoup时,它不会获取注释.

Hence, when I use jsoup, it doesn't fetch the comments.

那么我该如何自动化获取帖子和评论的过程?

So how can I automate the process of fetching posts and comments?

推荐答案

Jsoup仅是 html 解析器.不幸的是,由于jsoup无法执行这些内容,因此无法解析任何javascript/ajax内容.

Jsoup is a html parser only. Unfortunately it's not possible to parse any javascript / ajax content, since jsoup can't execute those.

解决方案:使用可以处理脚本的库.

The solution: using a library which can handle Scripts.

以下是一些我知道的例子:

Here are some examples i know:

  • HtmlUnit
  • Java Script Engine
  • Apache Commons BSF
  • Rhino

如果这样的库不支持解析或选择器,则至少可以使用它们从脚本中提取Html(然后可以通过jsoup对其进行解析).

If such a library doesn't support parsing or selectors, you can at least use them to get Html out of the scripts (which then can be parsed by jsoup).

这篇关于获取网页内容(通过AJAX调用加载)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆