刮除结果与检查的DOM元素不同 [英] Scraping result is different from inspected DOM element
问题描述
我想使用Python中的Selenium webdriver解析网页中的价格列表.因此,我尝试使用此代码获取所有DOM元素
I want to parse list of price in a web page using Selenium webdriver in Python. So, I try to fetch all the DOM elements using this code
url = 'https://www.google.com/flights/explore/#explore;f=BDO;t=r-Asia-0x88d9b427c383bc81%253A0xb947211a2643e5ac;li=0;lx=2;d=2018-01-09'
driver = webdriver.Chrome()
driver.get(url)
print(driver.page_source)
问题是我从page_source
获得的内容与在检查的元素中看到的内容不同
The problem is what I got from page_source
is different from what I see in the inspected element
<div class="CTPFVNB-f-a">
<div class="CTPFVNB-f-c"></div>
<div class="CTPFVNB-f-d elt="toolbelt"></div>
<div class="CTPFVNB-f-e" elt="result">Here is the difference</div>
</div>
差异存在于CTPFVNB-f-e
类内部.在检查的DOM元素中,此标记保存了我要获取的所有价格.但是,由于page_source
的结果,这部分丢失了.
The difference exist inside the CTPFVNB-f-e
class. In the inspected DOM element, this tag hold all the prices that I want to fetch. But, in the result of page_source
, this part is missing.
谁能告诉我我的代码出了什么问题?还是我需要进一步的步骤来解析价格列表?
Could anyone tell me what is wrong with my code? Or do I need further steps to parse the list of prices?
推荐答案
页面加载后,JavaScript正在修改页面.打开页面后立即打印页面源代码时,无需执行JavaScript,即可获取初始代码.
JavaScript is modifying the page after the page loads. As you are printing page source immediately after opening the page, you're getting the initial code without the execution of JavaScript.
您可以执行以下任一操作:
You can do any one of the following things:
- 添加延迟:使用
time.sleep(x)
(根据您的要求更改x
的值.以秒为单位)(建议不) - 隐式等待:
driver.implicitly_wait(x)
(同样,x
与上面相同) - 明确等待::等待HTML元素出现,然后获取页面源.要了解如何执行此操作,请参考此链接. (高度推荐)
- Add delay: Using
time.sleep(x)
(change value ofx
according to your requirements. it is in seconds) (NOT recommended) - Implicit wait:
driver.implicitly_wait(x)
(againx
is same as above) - Explicit wait: Wait for the HTML element to appear and then get the page source. To learn how to do this, refer this link. (HIGHLY recommended)
使用显式等待是此处的更好的选择,因为它仅等待元素可见所需的时间.因此不会造成任何额外的延迟.或者,如果页面加载速度比预期的慢,您将无法通过隐式等待获得所需的输出.
Using explicit wait is the better option here as it waits only for the time required for the element to become visible. Thus won't cause any excess delays. Or if the page loads slower than expected, you won't get the desired output using implicit wait.
这篇关于刮除结果与检查的DOM元素不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!