为什么我的 Jsoup 代码没有返回正确的元素? [英] Why is my Jsoup Code not Returning the Correct Elements?

查看:40
本文介绍了为什么我的 Jsoup 代码没有返回正确的元素?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在 Android Studio 中开发一个应用程序,并且在使用 JSoup 进行网络抓取时遇到了一些问题.我已经成功连接到网页并返回了一些基本元素来测试库,但现在我实际上无法获得我的应用程序所需的元素.

我正在尝试使用data-at"获取多个元素属性.奇怪的是,一些元素带有data-at"属性被返回,但不是我正在寻找的那些.无论出于何种原因,我的代码都没有提取所有共享data-at"的元素.网页上的属性.

这是我正在抓取的网页的 URL:

我要检索的元素

实际正在检索的元素之一

解决方案

这是因为某些内容 - 包括您正在寻找的内容 - 是异步创建的,并且不存在于初始 DOM (Javascript;))

当您查看页面的源代码时,您会注意到只有 17 个 data-at 出现,而​​运行 document.querySelector("[data-at]") 返回 29 个节点.

您可以在 JSoup 中获得的是页面的静态内容(初始 DOM).您将无法获取动态创建的内容,因为您没有运行所需的 JS 脚本.

为了克服这个问题,您必须手动获取和解析所需的资源(例如跟踪浏览器进行了哪些 AJAX 调用)或使用无头浏览器设置.Selenium + Headless Chrome 应该就够了.

Letter 选项将允许您抓取任何可能的 Web 应用程序,包括 SPA 应用程序,这是使用普通 Jsoup 无法实现的.

I am working on an app in Android Studio and am having some trouble web-scraping with JSoup. I have successfully connected to the webpage and returned some basic elements to test the library, but now I cannot actually get the elements I need for my app.

I am trying to get a number of elements with the "data-at" attribute. The weird thing is, a few elements with the "data-at" attribute are returned, but not the ones I am looking for. For whatever reason my code is not extracting all of the elements that share the "data-at" attribute on the web page.

This is the URL of the webpage I am scraping: https://express.liatoyotaofcolonie.com/inventory?f=dealer.name%3ALia%20Toyota%20of%20Colonie&f=submodel%3ACamry&f=trim%3ALE&f=year%3A2020

The method containing the web-scraping code:

@Override
    protected String doInBackground(Void... params) {
        String title = "";
        Document doc;
        Log.d(TAG, queryString.toString());
        try {
            doc = Jsoup.connect(queryString.toString()).get();
            Elements content = doc.select("[data-at]");
            for (Element e: content) {
                Log.d(TAG, e.text());
            }
        } catch (IOException e) {
            Log.e(TAG, e.toString());
        }
        return title;
    }

The results in Logcat

The element I want to retrieve

One of the elements that is actually being retrieved

解决方案

This is because some of the content - including the one you are looking for - is created asyncronously and is not present in initial DOM (Javascript ;))

When you view the source of the page you will notice that there is only 17 data-at occurences, while running document.querySelector("[data-at]") 29 nodes are returned.

What you are able to get in the JSoup is static content of the page (initial DOM). You wont be able to fetch dynamically created content as you do not run required JS scripts.

In order to overcome this, you will have to either fetch and parse required resources manually (eg trace what AJAX calls are made by the browser) or use headless browser setup. Selenium + headless Chrome should be enough.

Letter option will allow you to scrape ANY posible web application, including SPA apps, which is not possible using plaing Jsoup.

这篇关于为什么我的 Jsoup 代码没有返回正确的元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆