使用jsoup模式窗口(对话框)的网页抓取 [英] Web scraping of modal window(dialogue box) using jsoup

查看:252
本文介绍了使用jsoup模式窗口(对话框)的网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究我必须从网站中提取数据的项目。该项目是在Java和该网站是在Java脚本。我使用Jsoup从网站中提取数据但是网页中有一些模态窗口(对话框,弹出窗口)。是否可以使用jsoup提取模态窗口的数据?
所以如果答案是肯定的,那我该怎么做呢?请提供链接,如果没有,那么有什么其他最好的方法呢?



感谢您的帮助。我真的很感激它。

解决方案

我假设模态是由Javascript生成的。
Jsoup只是一个解析器。这意味着它将发出一个HTTP请求(GET或POST,无论你告诉它做什么),服务器(网站)会用最初的html做出响应。通过说初始,我的意思是之前任何javascript被执行。

Javascript可以生成html(如所讨论的模式),但这对Jsoup不可见,因为解析器只能读取,不能执行代码。浏览器能够生成模式,因为它包含一个解析执行Javascript的Javascript执行引擎。

访问网页时,您不知道什么是动态的(由Javascript生成的),什么是静态的(由服务器按原样提取)。
检查什么是动态的,什么是静态的(Jsoup可以看到静态的)一个小技巧是执行以下操作:


  1. 访问你想分析的网页(如果可能的话,使用chrome,mozilla也可以工作)。

  2. 按Ctrl + U。这会打开一个新标签。

新选项卡将包含html,css和js的一些网格。这是服务器获取到浏览器的内容,对Jsoup也是可见的。
如果模态在那里,那么很好,它对Jsoup是可见的。如果没有,那么你必须使用一个充当无头浏览器的库。



无头浏览器本质上是一个没有图形界面的浏览器。它可以解析和执行Javascript。它看到浏览器看到的是什么。

使用的最常见库为硒webdriver 。要小心,硒是一个有很多部分的测试框架。你需要的是webdriver。
现在有很多例子可以用现成的代码来开始。

I am studying about the project in which I have to extract the data from the website . The project is in java and the website is in java script . I am using Jsoup to extract the data from the website But there are some modal windows(dialogue box , pop up windows) present in the web page.So Is it possible to extract the data of modal windows using jsoup????? So if answer is yes , then how could I do it?? please provide links and if not, then what are the other best ways to do it???

Thanks for your help. I really appreciate it.

解决方案

I assume that the modal is generated by Javascript. Jsoup is just a parser. This means that it will make an HTTP request (GET or POST, whatever you tell it to do) and the server (website) will respond with the initial html. By saying initial, I mean the html before any javascript is executed.

Javascript can generate html (like the modal in question), but this is not visible to Jsoup because a parser can only read, it cannot execute code. The browser is able to generate the modal because it includes a Javascript execution engine that parses and executes Javascript.

When you visit a web page you don't know what is dynamic (generated by Javascript) and what is static (fetched by the server as is). A little trick to check what is dynamic and what is static (static is visible to Jsoup) is to do the following:

  1. Visit the web page you want to parse (with chrome if possible, mozilla will work too I think).
  2. Press Ctrl + U. This will open a new tab.

The new tab will contain some mesh of html, css and js. This is what the server fetches to the browser and is also visible to Jsoup. If the modal is in there, then great, it is visible to Jsoup. If not, then you have to use a library that acts as a headless browser.

A headless browser is essentially a browser without the graphical interface. It can parse and execute Javascript. It "sees" what a normal browser sees.

The most common library used is selenium webdriver. Be careful, selenium is a testing framework that has a lot of parts. What you need is the webdriver. There a lot of examples out there with ready made code to get you started.

这篇关于使用jsoup模式窗口(对话框)的网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆