为什么我的Jsoup代码没有返回正确的元素? [英] Why is my Jsoup Code not Returning the Correct Elements?

查看:74
本文介绍了为什么我的Jsoup代码没有返回正确的元素?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Android Studio中开发一个应用程序,但是在使用JSoup进行网络抓取时遇到了一些麻烦.我已经成功连接到该网页,并返回了一些基本元素来测试该库,但是现在我实际上无法获得应用程序所需的元素.

我正在尝试通过"data-at"获取许多元素.属性.奇怪的是,一些带有"data-at"的元素属性被返回,但不是我要查找的属性.无论出于何种原因,我的代码都不会提取共享"data-at"数据的元素的 所有 .网页上的属性.

这是我要抓取的网页的URL: https://express.liatoyotaofcolonie.com/inventory?f=dealer.name%3ALia%20Toyota%20of%20Colonie&f=submodel%3ACamry&f=trim%3ALE&f=year%3A2020

包含网页抓取代码的方法:

@Override
    protected String doInBackground(Void... params) {
        String title = "";
        Document doc;
        Log.d(TAG, queryString.toString());
        try {
            doc = Jsoup.connect(queryString.toString()).get();
            Elements content = doc.select("[data-at]");
            for (Element e: content) {
                Log.d(TAG, e.text());
            }
        } catch (IOException e) {
            Log.e(TAG, e.toString());
        }
        return title;
    }

Logcat中的结果

我要检索的元素

实际被检索的元素之一

解决方案

这是因为某些内容-包括您要查找的内容-是异步创建的,并且不存在于初始DOM(Javascript;)中)

当您查看页面源代码时,您会发现只有17个data-at事件出现,而在运行document.querySelector("[data-at]")时返回了29个节点.

您可以在JSoup中获得的是页面的静态内容(初始DOM).您将无法获取动态创建的内容,因为您没有运行必需的JS脚本.

为了克服这个问题,您将必须手动获取和解析所需的资源(例如,跟踪浏览器进行的AJAX调用)或使用无头浏览器设置.硒+无头铬应该足够了.

Letter选项将使您可以废弃任何可能的Web应用程序,包括SPA应用程序,而使用纯Jsoup无法实现.

I am working on an app in Android Studio and am having some trouble web-scraping with JSoup. I have successfully connected to the webpage and returned some basic elements to test the library, but now I cannot actually get the elements I need for my app.

I am trying to get a number of elements with the "data-at" attribute. The weird thing is, a few elements with the "data-at" attribute are returned, but not the ones I am looking for. For whatever reason my code is not extracting all of the elements that share the "data-at" attribute on the web page.

This is the URL of the webpage I am scraping: https://express.liatoyotaofcolonie.com/inventory?f=dealer.name%3ALia%20Toyota%20of%20Colonie&f=submodel%3ACamry&f=trim%3ALE&f=year%3A2020

The method containing the web-scraping code:

@Override
    protected String doInBackground(Void... params) {
        String title = "";
        Document doc;
        Log.d(TAG, queryString.toString());
        try {
            doc = Jsoup.connect(queryString.toString()).get();
            Elements content = doc.select("[data-at]");
            for (Element e: content) {
                Log.d(TAG, e.text());
            }
        } catch (IOException e) {
            Log.e(TAG, e.toString());
        }
        return title;
    }

The results in Logcat

The element I want to retrieve

One of the elements that is actually being retrieved

解决方案

This is because some of the content - including the one you are looking for - is created asyncronously and is not present in initial DOM (Javascript ;))

When you view the source of the page you will notice that there is only 17 data-at occurences, while running document.querySelector("[data-at]") 29 nodes are returned.

What you are able to get in the JSoup is static content of the page (initial DOM). You wont be able to fetch dynamically created content as you do not run required JS scripts.

In order to overcome this, you will have to either fetch and parse required resources manually (eg trace what AJAX calls are made by the browser) or use headless browser setup. Selenium + headless Chrome should be enough.

Letter option will allow you to scrap ANY posible web application, including SPA apps, which is not possible using plaing Jsoup.

这篇关于为什么我的Jsoup代码没有返回正确的元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆