使用Python requests.get解析一次不加载的HTML代码 [英] Using Python requests.get to parse html code that does not load at once

查看：396 发布时间：2018/6/19 14:12:27 python html web-scraping python-requests

本文介绍了使用Python requests.get解析一次不加载的HTML代码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试编写一个Python脚本，它将定期检查一个网站以查看某个项目是否可用。我过去成功地使用了requests.get，lxml.html和xpath来自动化网站搜索。在这个特定网址的情况下（ http：//www.anthropologie .com / anthro / product / 4120200892474.jsp？cm_vc = SEARCH_RESULTS＃/ ）和其他人在同一个网站上，我的代码无法正常工作。

 从lxml导入请求
导入html 
 page = requests.get（http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS ＃/）
 tree = html.fromstring（page.text）
 html_element = tree.xpath（.// div [@ class ='product-soldout ng-scope']）

在这一点上，html_element应该是一个元素列表（我认为在这种情况下只有1个），但是它是空的。我认为这是因为网站并不是一次加载全部，所以当requests.get（）出去抓取它时，它只是抓住第一部分。所以我的问题是
1：我对问题的评估是否正确？
和
2：如果是这样，有没有办法让requests.get（）在返回html之前等待，或者可能完全是为了获得整个页面。

谢谢

编辑：感谢这两个回应。我使用硒并让我的脚本工作。

解决方案

您对问题的评估不正确。

您可以检查结果，看看结尾处是否有< / html> 。这意味着你已经获得了整个页面。

和 requests.text 总是抓住整个页面;如果你想一次流一点，你必须明确地做到这一点。

你的问题在于表格实际上并不存在于HTML中，它由客户端JavaScript动态构建。你可以通过实际阅读返回的HTML来看到。所以，除非你运行这个JavaScript，否则你没有这些信息。

有很多通用的解决方案。例如：

使用 selenium 或类似来驱动实际的浏览器下载页面。

手动计算JavaScript代码的功能，并在Python中执行相同的工作。
运行一个针对DOM的无头JavaScript解释器，已建成。

I am trying to write a Python script that will periodically check a website to see if an item is available. I have used requests.get, lxml.html, and xpath successfully in the past to automate website searches. In the case of this particular URL (http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/) and others on the same website, my code was not working.

import requests
from lxml import html
page = requests.get("http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/")
tree = html.fromstring(page.text)
html_element = tree.xpath(".//div[@class='product-soldout ng-scope']")

at this point, html_element should be a list of elements (I think in this case only 1), but instead it is empty. I think this is because the website is not loading all at once, so when requests.get() goes out and grabs it, it's only grabbing the first part. So my questions are 1: Am I correct in my assessment of the problem? and 2: If so, is there a way to make requests.get() wait before returning the html, or perhaps another route entirely to get the whole page.

Thanks

Edit: Thanks to both responses. I used Selenium and got my script working.

解决方案

You are not correct in your assessment of the problem.

You can check the results and see that there's a </html> right near the end. That means you've got the whole page.

And requests.text always grabs the whole page; if you want to stream it a bit at a time, you have to do so explicitly.

Your problem is that the table doesn't actually exist in the HTML; it's build dynamically by client-side JavaScript. You can see that by actually reading the HTML that's returned. So, unless you run that JavaScript, you don't have the information.

There are a number of general solutions to that. For example:

Use selenium or similar to drive an actual browser to download the page.
Manually work out what the JavaScript code does and do equivalent work in Python.
Run a headless JavaScript interpreter against a DOM that you've built up.

这篇关于使用Python requests.get解析一次不加载的HTML代码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Python requests.get解析一次不加载的HTML代码 [英] Using Python requests.get to parse html code that does not load at once

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用Python requests.get解析一次不加载的HTML代码 [英] Using Python requests.get to parse html code that does not load at once

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭