如何从HTML页面解析隐藏的Javascript部分 [英] How to parse a hidden Javascript section from a HTML page

查看:86
本文介绍了如何从HTML页面解析隐藏的Javascript部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从此URL上的隐藏日历中解析9月的日期和价格:

I want to parse the dates+prices of the month September from the hidden calendar on this URL: http://www.lufthansa.com/vol/vol-paris-berlin . The problem here is that when you press on the month September the page will generate the calendar but without changement in the url. I used this code but no result.

public static void main(String[] args) 
        throws FailingHttpStatusCodeException, MalformedURLException, IOException {

    WebClient webClient = new WebClient();
    HtmlPage myPage = webClient.getPage("http://www.lufthansa.com/vol/vol-paris-berlin");
    Document doc = Jsoup.parse(myPage.asXml());

    for(Element s : doc.select("button.daygrid_cell.hasprice")) {
        String weekday_text = s.select(".weekday_text").text();
        String pricebox = s.select(".pricebox > .br").text();
        System.out.println(
                String.format(
                        "weekday_text=%s pricebox=%s", 
                        weekday_text, 
                        pricebox));
    }

    webClient.close();}

推荐答案

我目前看不到htmlUnit的使用方法.

I currently don't see a way with htmlUnit myself.

不过,您可以使用汉莎航空页面用来填充日历视图的相同查询来裁剪中间人":

You could "cut out the middleman" though, using the same query the lufthansa page uses to populate the calendar view:

响应为JSON格式,因此您可以使用JSON解析器以与汉莎航空页面相同的表示形式提取信息(价格始终四舍五入至下一个整数).在以下示例中,我使用了 json-simple :

The response is in JSON format, so you could extract the information in the same presentation as on the lufthansa page (prices are always rounded up to next integer) using a JSON parser. In the following example I used json-simple:

Map<String, Integer> prices = new TreeMap<String, Integer>(); // sorted map/keys in sorted order

try {
    Document doc = Jsoup
                .connect("https://bestprice-live-backend.mcon.net/flights-by-day?l=fr_fr&departure=PAR&destination=BER&departureFrom=2016-09-01&departureTo=2016-09-30&cabin=Economy&duration=7")
                .userAgent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36")
                .referrer("http://www.lufthansa.com/vol/vol-paris-berlin")
                .get();

    JSONObject obj = (JSONObject) new JSONParser().parse(doc.text());

    obj = (JSONObject) obj.get("dates");

    for (Iterator<?> iterator = obj.keySet().iterator(); iterator.hasNext();) {
        String key = (String) iterator.next();
        JSONObject dateObject = (JSONObject) obj.get(key);
        Double price = (Double) dateObject.get("price");
        int roundedPrice = (int) Math.ceil(price); // lufthansa displays prices rounded up
        prices.put(key, roundedPrice);
    }

    for (String key : prices.keySet()) {
        System.out.println(key + ": " + prices.get(key) + " €");
    }
} catch (IOException e) {
    e.printStackTrace();
} catch (ParseException e) {
    e.printStackTrace();
}

输出:

2016-09-01: 163 €
2016-09-02: 158 €
2016-09-03: 160 €
2016-09-04: 160 €
2016-09-05: 160 €
2016-09-06: 158 €
2016-09-07: 155 €
2016-09-08: 159 €
2016-09-09: 160 €
2016-09-10: 156 €
2016-09-11: 160 €
2016-09-12: 159 €
2016-09-13: 157 €
2016-09-14: 158 €
2016-09-15: 160 €
2016-09-16: 184 €
2016-09-17: 156 €
2016-09-18: 160 €
2016-09-19: 179 €
2016-09-20: 159 €
2016-09-21: 163 €
2016-09-22: 180 €
2016-09-23: 188 €
2016-09-24: 160 €
2016-09-25: 160 €
2016-09-26: 160 €
2016-09-27: 154 €
2016-09-28: 157 €
2016-09-29: 159 €
2016-09-30: 163 €

这篇关于如何从HTML页面解析隐藏的Javascript部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆