无法使用python获取javascript生成的html [英] Can't get javascript generated html using python

查看:38
本文介绍了无法使用python获取javascript生成的html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个自动获取网页上表格内容的 Python 脚本.我设法让它在纯 html 页面上工作,但是有一个网站让我头疼...html 似乎是由 javascript 生成的.我从几个帖子中找到的示例中尝试了 dryscrape、selenium 和 qt4 库,但仍然没有成功......我可以在浏览器上看到表格,当我用 Chrome检查"html 时.当我在 Chrome 中执行查看页面源代码"时,表格也不在那里......可能这可以提供一些提示.

I'm trying to create a python script that automatically gets the content of a table on a webpage. I manage to have it to work on pure html page, but there is one website that gives me headache... The html seems to be generated by javascript. I tried dryscrape, selenium and qt4 libraries from examples found on several posts but still without success... I just get all the time the html before the javascript did his job.... so without tables.... I can see the table on the browser and when I do "Inspect" the html with Chrome. When I do "View Page Source" in Chrome the table is also not there... may be this can give some hints.

网站如下:

https://www.ictax.admin.ch/extern/en.html#/security/CH0008899764/20161231

这是我尝试过的一些代码(如果您检查,答案中没有表格标签):

Here is some code I tried out (no table tags in the answer if you check):

使用 urlib2:

import urllib2
url="https://www.ictax.admin.ch/extern/en.html#/security/CH0008899764/20161231"
html = urllib2.urlopen(url)
print html

使用干刮:

import dryscrape 
session = dryscrape.Session()
session.visit(url) 
response = session.body()
print response

使用硒:

from selenium import webdriver
driver = webdriver.Chrome("/usr/lib/chromium/chromedriver")
driver.get(url)
print driver.page_source #page_source fetches page after rendering is complete
driver.quit()

使用 PyQt4

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html 

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit() 

#This does the magic.Loads everything
r = Render(url)  
#result is a QString.
result = r.frame.toHtml()
#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())
print formatted_result

如果有人能给我一些帮助,我将不胜感激:-)

I would appreciate so much if somebody could give me some help on this :-)

干杯

推荐答案

使用隐式等待(或显式等待?)在搜索任何元素之前等待页面加载:

Use an implicit wait (or an explicit one?) to wait for the page to load before searching for any elements:

import selenium
from selenium import webdriver
driver = webdriver.PhantomJS()
url = "https://www.ictax.admin.ch/extern/en.html#/security/CH0008899764/20161231"
driver.get(url)
driver.implicitly_wait(30)
print(driver.find_element_by_tag_name("table").text)

这是我得到的输出:

Titel/Titres/Titoli W Nominell Valoren-Nr.Steuerwert Ertrag/Rendement/Reddito 2016 M Valeur No de Val.可能的数据/日期CP.W Brutto KG/KEP zu versteuernd V 名义价值 Val.不可能的数据 M Brut Ertrag/Rendement Valore Numero di 31.12.2016 ex.扎尔布伏lordo imposable/Redditonomale valore pay.不负责任的瑞士法郎 (E) 页.神父CHF CHF iShares ETF (CH) - iShares SMI (R) (CH), Schweiz
0.00 889 976 85.31 25.02 瑞士法郎.29.02.36 瑞士法郎 0.48
03.03.07.03.37 瑞士法郎 0.48
11.04.13.04.38 瑞士法郎 0.70
19.07.21.07.40 瑞士法郎 0.88
19.07.21.07.39 瑞士法郎 0.20

Titel/Titres/Titoli W Nominell Valoren-Nr. Steuerwert Ertrag / Rendement / Reddito 2016 M Valeur No de Val. imposable Datum / Date Cp. W Brutto KG/KEP zu versteuernder V nominale valeur Val. imposible Data M Brut Ertrag/Rendement Valore Numero di 31.12.2016 ex. zahlb. V lordo imposable/Reddito nominale valore pay. imponible CHF (E) pag. Fr.W. CHF CHF iShares ETF (CH) - iShares SMI (R) (CH), Schweiz
CHF 0.00 889 976 85.31 25.02. 29.02. 36 CHF 0.48
03.03. 07.03. 37 CHF 0.48
11.04. 13.04. 38 CHF 0.70
19.07. 21.07. 40 CHF 0.88
19.07. 21.07. 39 CHF 0.20

这篇关于无法使用python获取javascript生成的html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆