在JS等存在的情况下,什么时候认为网页被“加载" [英] When is a webpage considered to be "loaded", in the presence of JS etc

查看:53
本文介绍了在JS等存在的情况下,什么时候认为网页被“加载"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

信息:我对 javascript 一无所知.没有.

Information: I have no knowledge of javascript. none.

我很好奇是否有任何方法可以确定网页何时完全加载?假设我有一个爬虫,它使用 webkit 来呈现页面(以及 webkit 的 JS 引擎来解析任何 JS 函数并完成处理 DOM 等),我很好奇是否有任何方法可以知道网页何时完成"加载?我认为应该做的:

I'm curious if there's any way to determine when a webpage is completely loaded? Let's say I have a crawler, that uses webkit to render pages (and webkit's JS engine to parse any JS functions and finish processing the DOM etc), I'm curious if there's any way to know when a webpage is 'done' loading? What I consider to be done:

1) 所有脚本都已执行完毕.2) 没有未决的 AJAX 调用.3) DOM 完全根据当前可用信息进行处理和加载.

1) All scripts have finished executing. 2) No pending AJAX calls. 3) The DOM is completely processed and loaded based on currently available information.

对于更具体的假设,通过查看几个网站的来源,我发现他们通过使用脚本标记加载广告,该标记将内容注入 DOM,并发出 AJAX 调用以加载和填充广告.如何确定这一切何时完成?

For a more concrete hypothetical, from looking at the source of a few sites, I see that they load ads by using a script tag that injects stuff into the DOM, and issues AJAX calls to load and populate the ads. How can one determine when all this is done?

(我猜用任何异步的例子代替.我想不出比上面更通用的东西.)

(replace the example by anything asynchronous, I guess. I just couldn't think of anything more universal than the above.)

检测",我的意思是,以任何可能的方式.例如,在页面中注入一些 JS 代码,向页面写入一些东西,让我知道事情已经完成.或者例如使用 QtWebkit,JS 可以调用 C++(我相信),因此 JS 代码段可以调用 C++ 函数来让它知道页面何时加载".简而言之,无论什么都有效.

By "detect", I mean, in any manner possible. For instance, injecting a bit of JS code into the page that writes something to the page to let me know stuff is done. Or for instance with QtWebkit, JS can call into C++(i believe), so a JS snippet could call a C++ function to let it know when the page was 'loaded'. Whatever works, in short.

当前的天真"实现我只是在加载页面后等待几秒钟.太蠢了.

The current 'naive' implementation I have just sits and waits for a few seconds after loading a page. It's stupid.

请尽可能详细,如果在我理解答案之前需要更多背景信息,请随时说先阅读本文".

Please be as detailed as possible, and feel free to say 'read this first' if more background information is required prior to me understanding the answer.

非常感谢!

推荐答案

通常无法确定包含异步、脚本驱动内容的页面是否真正完成加载.除了暂停问题的基本问题外,脚本或插件还可以注册定期计时器事件并无限期地继续修改或添加到页面.

It's in general impossible to say whether a page that contains asynchronous, script-driven content is truly done loading. Aside from the fundamental issue of the halting problem, it's possible for scripts or plugins to register for periodic timer events and continue modifying or adding to the page indefinitely.

我通常看到的用于确定页面何时完成加载的方法是在整个 DOM 已加载、从该 DOM 直接引用的资源(图像、样式表、脚本等)已加载以及所有脚本已加载时代码已被读取并执行一次.为此目的,通过 document.write() 发出的文本被视为直接包含在源 HTML 中.如果您使用的是 QtWebKit,我相信这是您在连接到信号 QWebPage::loadFinished(bool) 时会看到的行为.(您可以使用访问器 page()QWebFrame 获取包含的 QWebPage.)

The approach I've usually seen for determining when a page is done loading is when the entire DOM has been loaded, resources (images, stylesheets, scripts, etc.) referenced directly from that DOM have been loaded, and all script code has been read and executed through once. Text emitted via document.write() is treated for this purpose as if it was directly included in the source HTML. If you're using QtWebKit, I believe this is the behavior you will see if you connect to the signal QWebPage::loadFinished(bool). (You can get the contained QWebPage from a QWebFrameusing the accessor page().)

由脚本代码设置的延迟动作,无论是定时器,等待其他资源加载完成的事件,还是你有的,都不计算在内;媒体播放器和其他插件可能会使事情进一步复杂化,因为每种媒体类型甚至播放器可能对加载"的构成有不同的标准.

Deferred actions set up by the script code, whether by timers, events waiting for load of other resources to complete, or what have you, is not counted; media players and other plugins may complicate things further because each media type or even player may have a different standard of what constitutes "loaded".

许多最近的 JavaScript 库利用这种行为来改善感知页面加载时间,方法是加载一个不完整的页面,该页面仅包含第一个屏幕的内容和一些脚本,并且实际上并不开始加载首屏"的图像和内容,直到在第一个屏幕左右完成加载和渲染之后.不过,它对自动化工具、抓取工具或那些认为 JavaScript 是受信任的网站所拥有的特权的人来说不是很友好.

A number of recent JavaScript libraries exploit this behavior to improve perceived page load times by loading an incomplete page containing just the first screen's worth of content plus some script, and not actually beginning to load images and content "below the fold" until after the first screenful or so is done loading and rendering. It's not very friendly to automated tools, crawlers or those who consider JavaScript a privilege to be earned by trusted sites, though.

这篇关于在JS等存在的情况下,什么时候认为网页被“加载"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆