阅读Java中网页的全部内容 [英] Read full content of a web page in Java

查看:78
本文介绍了阅读Java中网页的全部内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用Java程序抓取以下链接的全部内容.第一页没问题,但是当我要抓取下一页的数据时,有与第一页相同的源代码.因此,简单的HTTP Get根本无济于事.

I want to crawl the whole content of the following link with a Java program. The first page is no problem, but when I want to crawl the data of the next pages, there is the same source code as for page one. Therefore a simple HTTP Get does not help at all.

This is the link for the page I need to crawl.
The web site has active contents that need to be interpreted and executed by a HMTL/CSS/JavaScript rendering engine. Therefore I have a simple solution with PhantomJS, but it is sophisticated to run PhantomJS code in Java.

有没有更简单的方法可以使用Java代码读取页面的全部内容?我已经在寻找解决方案,但是找不到适合的解决方案.

Is there any easier way to read the whole content of the page with Java code? I already searched for a solution, but could not find anything suitable.

感谢您的帮助,
亲切的问候.

Appreciate your help,
kind regards.

推荐答案

使用Chrome网络日志(或任何其他浏览器中的类似工具),您可以识别XHR请求,以加载页面上显示的实际数据.我已经删除了一些查询参数,但是本质上,请求看起来像这样:

Using the Chrome network log (or a similar tool in any other browser) you can identify the XHR request that loads the actual data displayed on the page. I have removed some of the query parameters, but essentially the request looks like this:

GET https://www.blablacar.de/search_xhr?fn=frankfurt&fcc=DE&tn=muenchen&tcc=DE&sort=trip_date&order=asc&limit=10&page=1&user_bridge=0&_=1461181945520

有用的是,查询参数看起来很容易理解. order=asc&limit=10&page=1部分看起来很容易调整以返回所需的结果.您可以调整page参数以抓取连续的数据页.

Helpfully, the query parameters look quite easy to understand. The order=asc&limit=10&page=1 part looks like it would be easy to adjust to return your desired results. You could adjust the page parameter to crawl successive pages of data.

响应是JSON,为此提供了大量库.

The response is JSON, for which there are a ton of libraries available.

这篇关于阅读Java中网页的全部内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆