使用Python requests.get解析不立即加载的html代码 [英] Using Python requests.get to parse html code that does not load at once

查看:17
本文介绍了使用Python requests.get解析不立即加载的html代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个 Python 脚本,它会定期检查网站以查看某个项目是否可用.我过去曾成功地使用 requests.get、lxml.html 和 xpath 来自动化网站搜索.对于这个特定的 URL (http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/) 和同一网站上的其他人,我的代码不起作用.

I am trying to write a Python script that will periodically check a website to see if an item is available. I have used requests.get, lxml.html, and xpath successfully in the past to automate website searches. In the case of this particular URL (http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/) and others on the same website, my code was not working.

import requests
from lxml import html
page = requests.get("http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/")
tree = html.fromstring(page.text)
html_element = tree.xpath(".//div[@class='product-soldout ng-scope']")

此时, html_element 应该是一个元素列表(我认为在这种情况下只有 1 个),但它是空的.我认为这是因为网站没有一次加载,所以当 requests.get() 出去抓取它时,它只抓取了第一部分.所以我的问题是1:我对问题的评估是否正确?和2:如果是这样,有没有办法让 requests.get() 在返回 html 之前等待,或者可能是另一个完全获取整个页面的路由.

at this point, html_element should be a list of elements (I think in this case only 1), but instead it is empty. I think this is because the website is not loading all at once, so when requests.get() goes out and grabs it, it's only grabbing the first part. So my questions are 1: Am I correct in my assessment of the problem? and 2: If so, is there a way to make requests.get() wait before returning the html, or perhaps another route entirely to get the whole page.

谢谢

感谢两位回复.我使用了 Selenium 并让我的脚本正常工作.

Thanks to both responses. I used Selenium and got my script working.

推荐答案

您对问题的评估不正确.

You are not correct in your assessment of the problem.

您可以检查结果并看到靠近结尾处有一个 </html>.这意味着您已经获得了整个页面.

You can check the results and see that there's a </html> right near the end. That means you've got the whole page.

而且 requests.text 总是 抓取整个页面;如果你想一次流式传输一点,你必须明确地这样做.

And requests.text always grabs the whole page; if you want to stream it a bit at a time, you have to do so explicitly.

您的问题是该表格实际上并不存在于 HTML 中;它是由客户端 JavaScript 动态构建的.您可以通过实际阅读返回的 HTML 来看到这一点.因此,除非您运行该 JavaScript,否则您将无法获得这些信息.

Your problem is that the table doesn't actually exist in the HTML; it's build dynamically by client-side JavaScript. You can see that by actually reading the HTML that's returned. So, unless you run that JavaScript, you don't have the information.

对此有许多通用的解决方案.例如:

There are a number of general solutions to that. For example:

  • 使用 selenium 或类似工具驱动实际浏览器下载页面.
  • 手动计算 JavaScript 代码的作用并在 Python 中完成等效的工作.
  • 针对您构建的 DOM 运行无头 JavaScript 解释器.
  • Use selenium or similar to drive an actual browser to download the page.
  • Manually work out what the JavaScript code does and do equivalent work in Python.
  • Run a headless JavaScript interpreter against a DOM that you've built up.

这篇关于使用Python requests.get解析不立即加载的html代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆