如何使用 Python 检索动态 html 内容的值 [英] How to retrieve the values of dynamic html content using Python

查看:30
本文介绍了如何使用 Python 检索动态 html 内容的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是 Python 3,并且正在尝试从网站检索数据.但是,这些数据是动态加载的,我现在的代码不起作用:

I'm using Python 3 and I'm trying to retrieve data from a website. However, this data is dynamically loaded and the code I have right now doesn't work:

url = eveCentralBaseURL + str(mineral)
print("URL : %s" % url);

response = request.urlopen(url)
data = str(response.read(10000))

data = data.replace("\n", "
")
print(data)

在我试图找到特定值的地方,我找到了一个模板,例如{{formatPrice medium}}"而不是4.48".

Where I'm trying to find a particular value, I'm finding a template instead e.g."{{formatPrice median}}" instead of "4.48".

如何才能检索值而不是占位符文本?

How can I make it so that I can retrieve the value instead of the placeholder text?

是我要访问的特定页面从中提取信息.我正在尝试获取中值"值,该值使用模板 {{formatPrice 中值}}

This is the specific page I'm trying to extract information from. I'm trying to get the "median" value, which uses the template {{formatPrice median}}

编辑 2:我已经安装并设置了我的程序以使用 Selenium 和 BeautifulSoup.

Edit 2: I've installed and set up my program to use Selenium and BeautifulSoup.

我现在的代码是:

from bs4 import BeautifulSoup
from selenium import webdriver

#...

driver = webdriver.Firefox()
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html)

print "Finding..."

for tag in soup.find_all('formatPrice median'):
    print tag.text

这里是程序执行时的屏幕截图.不幸的是,它似乎没有找到任何指定了formatPrice 中位数"的内容.

Here is a screenshot of the program as it's executing. Unfortunately, it doesn't seem to be finding anything with "formatPrice median" specified.

推荐答案

假设您正在尝试从使用 javascript 模板(例如像 handlebars),那么这就是任何标准解决方案(即 beautifulsouprequests)都会得到的结果.

Assuming you are trying to get values from a page that is rendered using javascript templates (for instance something like handlebars), then this is what you will get with any of the standard solutions (i.e. beautifulsoup or requests).

这是因为浏览器使用 javascript 来改变它接收到的内容并创建新的 DOM 元素.urllib 会像浏览器一样完成请求部分,而不是模板渲染部分.对问题的良好描述可以在这里找到.本文讨论了三种主要解决方案:

This is because the browser uses javascript to alter what it received and create new DOM elements. urllib will do the requesting part like a browser but not the template rendering part. A good description of the issues can be found here. This article discusses three main solutions:

  1. 直接解析ajax JSON
  2. 使用离线 Javascript 解释器处理请求 SpiderMonkey, crowbar
  3. 使用浏览器自动化工具 分裂

这个答案提供了更多的选项建议3,例如selenium 或watir.我已经将 selenium 用于自动化 Web 测试,而且它非常方便.

This answer provides a few more suggestions for option 3, such as selenium or watir. I've used selenium for automated web testing and its pretty handy.

编辑

根据您的评论,它看起来像是一个由车把驱动的网站.我推荐硒和美丽的汤.这个答案给出了可能有用的好代码示例:

From your comments it looks like it is a handlebars driven site. I'd recommend selenium and beautiful soup. This answer gives a good code example which may be useful:

from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')

html = driver.page_source
soup = BeautifulSoup(html)

# check out the docs for the kinds of things you can do with 'find_all'
# this (untested) snippet should find tags with a specific class ID
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for tag in soup.find_all("a", class_="my_class"):
    print tag.text

基本上,selenium 从您的浏览器获取呈现的 HTML,然后您可以使用 page_source 属性中的 BeautifulSoup 解析它.祝你好运:)

Basically selenium gets the rendered HTML from your browser and then you can parse it using BeautifulSoup from the page_source property. Good luck :)

这篇关于如何使用 Python 检索动态 html 内容的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆