Jsoup-隐藏的div类? [英] Jsoup - hidden div class?

查看:208
本文介绍了Jsoup-隐藏的div类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取div类,但到目前为止,我尝试过的所有操作都失败了:(

我正在尝试刮擦元素:

<a href="http://www.bellator.com/events/d306b5/bellator-newcastle-pitbull-vs- 
scope"><div class="s_buttons_button s_buttons_buttonAlt 
s_buttons_buttonSlashBack">More info</div></a>

从网站上: http://www.bellator.com/events

我尝试通过这样做来访问元素列表

Elements elements = document.select("div[class=s_container] > li");

但是那什么也没返回.

然后我尝试使用

仅访问父级

Elements elements = document.select("div[class=s_container]");

并返回了两个div,其类名为"s_container",这是我需要的一个:<

然后我尝试使用

Elements elements = document.select("div[class=ent_m152_bellator module 
ent_m152_bellator_V1_1_0 ent_m152]");

那并没有返回任何东西

我也尝试过

Elements elements = document.select("div[class=ent_m152_bellator]");

因为我不确定空白,但是它也不返回任何内容

然后我尝试通过

访问其父项

Elements elements = document.select("div#t3_lc");

可以,但是它返回了一个包含

的元素

<div id="t3_lc"> 
<div class="triforce-module" id="t3_lc_promo1"></div> 
</div>

这有点奇怪,因为当我用chrome:S检查网站时,我看不到它有那个孩子.

任何人都知道发生了什么事吗?我感到有点迷茫.

解决方案

在Web浏览器中看到的不是Jsoup看到的.禁用JavaScript和刷新页面以获取Jsoup所获得的内容,或者在修改JavaScript之前,在浏览器中按CTRL + U(显示源代码",而不是检查"!)以查看原始HTML文档.当您使用浏览器的调试器时,它会显示修改后的最终文档,因此不符合您的需求.

似乎整个即将发生的事件"部分都是由JavaScript动态加载的. 更重要的是,此部分是通过AJAX异步加载的.您可以使用浏览器调试器(网络"标签)查看所有可能的请求和响应.

我找到了它,但是不幸的是,您需要的所有数据都以JSON的形式返回,因此您将需要另一个库来解析JSON.

这不是坏消息的结局,这种情况更加复杂.您可以直接请求数据: http://www.bellator.com/feeds/ent_m152_bellator/V1_1_0/d10a728c-547e-4a6f-b140-7eecb67cff6b 但该URL似乎是随机的,并且这些URL中的很少(每个即将发生的事件之一)都包含在HTML的JavaScript代码中.

我的方法是通过以下方式获取这些供稿的URL:


        List<String> feedUrls = new ArrayList<>();

        //select all the scripts
        Elements scripts = document.select("script");
        for(Element script: scripts){
            if(script.text().contains("http://www.bellator.com/feeds/")){
                // here use regexp to get all URLs from script.text() and add them to feedUrls

            }
        }

        for(String feedUrl : feedUrls){
            // iterate over feed URLs, download each of them
            String json = Jsoup.connect(feedUrl).ignoreContentType(true).get().body().toString();
            // here use JSON parsing library to get the data you need

        }

另一种方法是,由于其局限性而停止使用Jsoup,而应使用Selenium Webdriver,因为它支持通过JavaScript进行动态页面修改,因此您将获得最终结果的HTML,与您在Web浏览器和Inspector中看到的完全一样. /p>

Im trying to scrape a div class but everything I have tried has failed so far :(

Im trying to scrape the element(s):

<a href="http://www.bellator.com/events/d306b5/bellator-newcastle-pitbull-vs- 
scope"><div class="s_buttons_button s_buttons_buttonAlt 
s_buttons_buttonSlashBack">More info</div></a>

from the website: http://www.bellator.com/events

I tried accessing the list of elements by doing

Elements elements = document.select("div[class=s_container] > li");

but that didnt return anything.

Then i tried accessing just the parent with

Elements elements = document.select("div[class=s_container]");

and that returned two div with classname "s_container", non of which is the one I needed :<

then i tried accessing that ones parent with

Elements elements = document.select("div[class=ent_m152_bellator module 
ent_m152_bellator_V1_1_0 ent_m152]");

And that didnt return anything

I also tried

Elements elements = document.select("div[class=ent_m152_bellator]");

because I wasnt sure about the white spaces but it didnt return anything either

Then I tried accessing its parent by

Elements elements = document.select("div#t3_lc");

and that worked, but it returned an element containing

<div id="t3_lc"> 
<div class="triforce-module" id="t3_lc_promo1"></div> 
</div>

which is kinda weird because i cant see that it has that child when i inspect the website in chrome :S

Anyone knows whats going on? I feel kinda lost..

解决方案

What you see in your web browser is not what Jsoup sees. Disable JavaScript and refresh page to get what Jsoup gets OR press CTRL+U ("Show source", not "Inspect"!) in your browser to see original HTML document before JavaScript modifications. When you use your browser's debugger it shows final document after modifications so it's not not suitable for your needs.

It seems like whole "UPCOMING EVENTS" section is dynamically loaded by JavaScript. Even more, this section is asynchronously loaded with AJAX. You can use your browsers debugger (Network tab) to see every possible request and response.

I found it but unfortunately all the data you need is returned as JSON so you're going to need another library to parse JSON.

That's not the end of the bad news and this case is more complicated. You could make direct request for the data: http://www.bellator.com/feeds/ent_m152_bellator/V1_1_0/d10a728c-547e-4a6f-b140-7eecb67cff6b but the URL seems random and few of these URLs (one per upcoming event?) are included inside JavaScript code in HTML.

My approach would be to get the URLs of these feeds with something like:


        List<String> feedUrls = new ArrayList<>();

        //select all the scripts
        Elements scripts = document.select("script");
        for(Element script: scripts){
            if(script.text().contains("http://www.bellator.com/feeds/")){
                // here use regexp to get all URLs from script.text() and add them to feedUrls

            }
        }

        for(String feedUrl : feedUrls){
            // iterate over feed URLs, download each of them
            String json = Jsoup.connect(feedUrl).ignoreContentType(true).get().body().toString();
            // here use JSON parsing library to get the data you need

        }

ALTERNATIVE approach would be to stop using Jsoup because of its limitations and use Selenium Webdriver as it supports dynamic page modifications by JavaScript so you'd get the HTML of the final result - exactly what you see in web browser and Inspector.

这篇关于Jsoup-隐藏的div类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆