在 Kotlin 中使用 jsoup 进行网页抓取 [英] Web scraping with jsoup in Kotlin

查看:52
本文介绍了在 Kotlin 中使用 jsoup 进行网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取

单击 XHR 请求之一,然后选择 Response 选项卡会显示响应包含我们正在寻找的内容.而且是HTML,所以jsoup可以解析:

这是响应(如果您想复制或操作它):

<div><div style='float:left;宽度:120px;字体粗细:粗体;'>下一个大奖

<span style='color:#EC243D;font-weight:bold'>$8,000,000 est</span>

<div><div style='float:left;宽度:120px;字体粗细:粗体;'>下一次开奖

<div class='toto-draw-date'>星期一,2021 年 11 月 15 日,晚上 9.30</div>

通过选择 Headers 选项卡(位于 Response 选项卡的左侧),我们看到 Request URLhttps://www.singaporepools.com.sg/DataFileArchive/Lottery/Output/toto_next_draw_estimate_en.html?v=2021y11m14d21h0m请求方法GET,然后Content-Typetext/html.

因此,根据我们找到的 URL 和 HTTP 方法,这里是抓取 HTML 的代码:

val document = Jsoup.connect(https://www.singaporepools.com.sg/DataFileArchive/Lottery/Output/toto_next_draw_estimate_en.html?v=2021y11m14d21h0m").userAgent(Mozilla").得到()val targetElement = 文档.身体().孩子们().单身的()val 短语 = targetElement.child(0).text()val Prize = targetElement.select("span").text().removeSuffix("est")println(phrase)//下一个累积奖金 $8,000,000 estprintln(奖品)//$8,000,000

I am trying to scrape this website as part of my lesson to learn Kotlin and Web scraping with jsoup.

What I am trying to scrape is the Jackpot $1,000,000 est. values.

The below code was something that I wrote after searching and checking out a couple of tutorials online, but it won't even give me $1,000,000 (which was what this code was trying to scrape).

Jsoup.connect("https://online.singaporepools.com/lottery/en/home")
    .get()
    .run {
        select("div.slab__text slab__text--highlight").forEachIndexed { i, element ->
            val titleAnchor = element.select("div")
            val title = titleAnchor.text()
            println("$i. $title")
        }
    }

My first thought is that maybe this website is using JavaScript. That's why it was not successful.

How should I be going about scraping it?

解决方案

I was able to scrape what you were looking for from this page on that same site.
Even if it's not what you want, the procedure may help someone in the future.

Here is how I did that:

  1. First I opened that page
  2. Then I opened the Chrome developer tools by pressing CTRL+ SHIFT+i or
    by right-clicking somewhere on page and selecting Inspect or
    by clicking ⋮ ➜ More toolsDeveloper tools
  3. Next I selected the Network tab
  4. And finally I refreshed the page with F5 or with the refresh button ⟳

A list of requests start to appear (network log) and after, say, a few seconds, all requests will complete executing. Here, we want to look for and inspect a request that has a Type like xhr. We can filter requests by clicking the filter icon and then selecting the desired type.

To inspect a request, click on its name (first column from left):

Clicking on one of the XHR requests, and then selecting the Response tab shows that the response contains exactly what we are looking for. And it is HTML, so jsoup can parse it:

Here is that response (if you want to copy or manipulate it):

<div style='vertical-align:top;'>
  <div>
    <div style='float:left; width:120px; font-weight:bold;'>
      Next Jackpot
    </div>
    <span style='color:#EC243D; font-weight:bold'>$8,000,000 est</span>
  </div>
  <div>
    <div style='float:left; width:120px; font-weight:bold;'>
      Next Draw
    </div>
    <div class='toto-draw-date'>Mon, 15 Nov 2021 , 9.30pm</div>
  </div>
</div>

By selecting the Headers tab (to the left of the Response tab), we see the Request URL is https://www.singaporepools.com.sg/DataFileArchive/Lottery/Output/toto_next_draw_estimate_en.html?v=2021y11m14d21h0m and the Request Method is GET and agian the Content-Type is text/html.

So, with the URL and the HTTP method we found, here is the code to scrape that HTML:

val document = Jsoup
    .connect("https://www.singaporepools.com.sg/DataFileArchive/Lottery/Output/toto_next_draw_estimate_en.html?v=2021y11m14d21h0m")
    .userAgent("Mozilla")
    .get()

val targetElement = document
    .body()
    .children()
    .single()

val phrase = targetElement.child(0).text()
val prize = targetElement.select("span").text().removeSuffix(" est")

println(phrase) // Next Jackpot $8,000,000 est
println(prize)  // $8,000,000

这篇关于在 Kotlin 中使用 jsoup 进行网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
其他开发最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆