硒获取页面源与在浏览器中单击右键不同 [英] selenium get page source different with right click in browser

查看:96
本文介绍了硒获取页面源与在浏览器中单击右键不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在解析网页时遇到问题,因为这样做时我会得到不同的页面来源:

I am having problem parse a web page, since I get different page source when I do:

display = Display(visible=False, size=(800, 600), backend='xvfb')
display.start()
driver = webdriver.Firefox()
url = "http://www.aaa.com"
driver.get(url)
with codecs.open('page.html', 'w', 'utf-8') as f:
    f.write(driver.page_source)

当我打开文件查看实际文本时,与在浏览器中单击鼠标右键会得到的结果有所不同.

When I open the file to see the actual text, it is different with what I get with right click in browser.

例如,某些href变为小写. 以及页面源代码中的一些标签:

For example, some href become lower case. And some tag in page source:

<table class="list" boroder="0" id="list_id">

变成

<table border="0" id="list_id" class="list">

我很确定这是我要求的网址...

I am pretty sure it is the same url I am requesting...

推荐答案

像您正在做的那样,获取网页源代码有两个主要问题.

There are two major issues at play in getting the source of a web page like you are doing.

  1. 尽管我们使用HTML描述网页,但浏览器不能直接使用HTML.他们将HTML转换为内部表示形式,称为DOM树. driver.page_source并将文件源保存到磁盘是在一个称为序列化的过程中将此DOM树转换回HTML.两个序列化程序,或用于两个不同配置的单个序列化程序,可以相同 DOM树不同地进行序列化.您遇到过这样的一种情况:

  1. Although we describe web pages using HTML, browsers don't work with HTML directly. They convert the HTML to an internal representation called a DOM tree. What driver.page_source and saving the source of a file to disk do is transform this DOM tree back to HTML in a process called serialization. Two serializers, or a single serializer which is used with two different configurations, can serialize the same DOM tree differently. You've encountered one such case with:

<table class="list" border="0" id="list_id">

<table border="0" id="list_id" class="list">

在上述两个实例中,属性的顺序不同.但是,这并不重要,因为属性不在HTML中排序. (元素以及标记元素开始和结束的标签是有序的.因此<a><b><b><a>不同.)由于序列化程序处理间距的方式,可能会出现其他差异.名称的大小写也可能不同:<TABLE><table>是等效的.这是因为HTML不区分大小写(XHTML区分大小写.)

In the two instances above, the order of attributes is different. However, it does not matter because attributes are not ordered in HTML. (Elements, and the tags that mark the start and end of elements, are ordered. So <a><b> is not the same as <b><a>.) Other differences could occur due to the way the serializers handle spacing. Names could also differ in capitalization: <TABLE> and <table> are equivalent. This is because HTML is not case-sensitive (XHTML is case-sensitive.)

不能保证Selenium和Firefox的保存菜单将使用具有完全相同配置的完全相同的序列化程序.因此,从这两种方法获得的结果之间可能会有差异.

There is no guarantee that Selenium and Firefox's save menu are going to use the exact same serializer with the exact same configuration. So there may be differences between what you get from the two methods.

可能导致您麻烦的另一件事是Ajax.如今,网页并不一开始就不包含其所需的所有元素,这已经很普遍了.其中一些元素在初始页面完成加载后不久就会加载.如果从driver.page_source 之后保存页面,则页面最初已加载,但之前,Ajax有机会加载其他元素,然后手动保存页面使用Firefox的菜单,可能会发生一些差异,因为driver.page_source会丢失通过Ajax加载的元素.

Another thing that may cause you trouble is Ajax. It is not rare nowadays that a web page does not initially contain all the elements that it needs. Some of these elements are loaded shortly after the initial page has finished loading. If you save the page from driver.page_source after the page has initially loaded but before the Ajax has had a chance to load the additional elements, and then you manually save the page using Firefox's menu, chances are that some differences will occur because driver.page_source misses the elements loaded through Ajax.

这篇关于硒获取页面源与在浏览器中单击右键不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆