无法使用jsoup检索表元素 [英] Unable to retrieve table elements using jsoup

查看:65
本文介绍了无法使用jsoup检索表元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是使用jsoup的新手,我正在努力检索表名为 verbtense 且标题为: Present Past ,位于来自此站点的 Indicative 的div下: https://www.verbix.com/webverbix/Swedish/misslyckas

I'm new to using jsoup and I am struggling to retrieve the tables with class name: verbtense with the headers: Present and Past, under the div named Indicative from the from this site: https://www.verbix.com/webverbix/Swedish/misslyckas

我已经开始尝试执行以下操作,但是get go没有任何结果:

I have started off trying to do the following, but there are no results from the get go:

Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();
Elements tables = document.select("table[class=verbtense]"); // empty

我也尝试过此操作,但再次没有结果:

I also tried this, but again no results:

        Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();

        Elements divs = document.select("div");


        if (!divs.isEmpty()) {
            for (Element div : divs) {
                // all of these are empty
                Elements verbTenses = div.getElementsByClass("verbtense");
                Elements verbTables = div.getElementsByClass("verbtable");
                Elements tables = div.getElementsByClass("table verbtable");
            }
        }

我做错了什么?

推荐答案

第一个问题是,此页面使用AJAX异步加载其内容,并使用JavaScript将内容添加到DOM.您甚至可以在短时间内看到装载机.

The first catch is that this page loads its content asynchronously using AJAX and uses JavaScript to add the content to the DOM. You can even see the loader for a short time.

Jsoup无法解析和执行JavaScript,因此您获得的只是初始页面:(下一步将是检查浏览器在做什么以及这些附加内容的来源.您可以使用Chrome的调试器(Ctrl + Shift + i)对其进行检查.如果打开网络"选项卡,则仅选择XHR通信并刷新页面,您将看到两个请求:

Jsoup can't parse and execute JavaScript so all you get is the initial page :( The next step would be to check what the browser is doing and what is the source of this additional content. You can check it using Chrome's debugger (Ctrl + Shift + i). If you open Network tab, select only XHR communication and refresh the page you can see two requests:

其中一个获得了这样的内容

One of them gets such content https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas as you can see it's a JSON with HTML fragments and this content seems to have verbs forms you need. But here's another catch because unfortunately Jsoup can't parse JSON :( So you'll have to use another library to get the HTML fragment and then you can parse it using Jsoup. General advice to download JSON is to ignore content type (Jsoup will complain it doesn't support JSON):

String json = Jsoup.connect("https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas").ignoreContentType(true).execute().body();

然后您将必须使用一些JSON解析库,例如 json-simple获取html片段,然后您可以使用Jsoup将其解析为HTML:

then you'll have to use some JSON parsing library for example json-simple to obtain html fragment and then you can parse it to HTML with Jsoup:

String json = Jsoup.connect(
    "https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas")
    .ignoreContentType(true).execute().body();
System.out.println(json);
JSONObject jsonObject = (JSONObject) JSONValue.parse(json);
String htmlFragmentObtainedFromJson = (String) ((JSONObject) jsonObject.get("p1")).get("html");
Document document = Jsoup.parse(htmlFragmentObtainedFromJson);
System.out.println(document);

现在,您可以尝试使用选择器从 document 对象中获取所需内容的初始方法.

Now you can try your initial approach with using selectors to get what you want from document object.

这篇关于无法使用jsoup检索表元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆