如何像在浏览器中一样检索确切的HTML [英] How to retrieve the exact HTML as in a browser
问题描述
我正在使用Python脚本渲染网页并检索其HTML.它在大多数页面上都可以正常工作,但是对于其中一些页面,检索到的HTML是不完整的.我不太明白为什么.这是我用来抓取此页面的脚本,由于某种原因,指向每个产品的链接都不在HTML中:
I'm using a Python script to render web pages and retrieve their HTML's. It works fine with most of the pages, but with some of them the HTML retrieved is incomplete. And I don't quite understand why. This is the script I'm using to scrap this page, for some reason, the link to every product is not in the HTML:
链接: http://www.pullandbear.com/es/es/mujer/vestidos-c29016.html
Python脚本:
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from PyQt4 import QtNetwork
from PyQt4 import QtCore
url = sys.argv[1]
path = sys.argv[2]
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.request = QtNetwork.QNetworkRequest()
self.request.setUrl(QtCore.QUrl(url))
self.request.setRawHeader("Accept-Language", QtCore.QByteArray ("es ,*"))
self.mainFrame().load(self.request)
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
r = Render(url)
result = r.frame.toHtml()
html_file = open(path, "w")
html_file.write("%s" % result.encode("utf-8"))
html_file.close()
sys.exit(app.exec_())
我错过了什么吗?这个框架有什么局限性?
Am I missing something? What are the limitations of this framework?
预先感谢
推荐答案
如果您想无头浏览,可以结合使用 phantomjs 使用硒,以下代码将获取所有源代码:
If you want headless browsing you can combine phantomjs with selenium, the following gets all the source:
url = "http://www.pullandbear.com/es/es/mujer/vestidos-c29016.html"
from selenium import webdriver
dr = webdriver.PhantomJS()
dr.get(url)
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
element = WebDriverWait(dr, 5).until(
EC.presence_of_element_located((By.CLASS_NAME, "grid_itemContainer"))
)
仅使用不带WebDriverWait
的硒并不总是返回完整的源代码,而是添加等待直到带有 grid_itemContainer 类的a标签可见,以确保html已生成,下面的xpath返回所有链接:
Just using selenium without the WebDriverWait
did not always return the full source, adding the wait until the a tags with the grid_itemContainer class were visible makes sure the html has been generated, the xpath below returns all your links:
print([a.get_attribute('href') for a in dr.find_elements_by_xpath("//a[@class='grid_itemContainer']")])
[u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-detalle-crochet-pechera-c29016p100064004.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-bordado-escote-pico-c29016p100123006.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-manga-larga-espalda-abierta-c29016p100147503.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-hombros-descubiertos-beads-c29016p100182001.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-jacquard-capa-c29016p100255505.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-vaquero-eyelets-c29016p100336010.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-liso-oversized-c29016p100289013.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-liso-oversized-c29016p100289013.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-camisero-oversized-c29016p100036616.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-pico-c29016p100166506.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-estampado-rayas-c29016p100234507.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-manga-corta-liso-c29016p100262008.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-largo-cuello-halter-liso-c29016p100036162.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-capa-jacquard-%C3%A9tnico-c29016p100259002.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-largo-cuello-halter-rayas-c29016p100036161.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-capa-jacquard-tri%C3%A1ngulo-c29016p100255506.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-marinero-escote-bardot-c29016p100259003.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-rayas-escote-espalda-c29016p100262007.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cruzado-c29016p100216013.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-flores-canes%C3%BA-bordado-c29016p100203011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-bordados-c29016p100037160.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-flores-volante-c29016p100216014.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-lencero-c29016p100104515.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuadros-detalle-encaje-c29016p100216016.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-drapeado-abertura-bajo-c29016p100129011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-drapeado-abertura-bajo-c29016p100129011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-vaquero-bolsillo-plastr%C3%B3n-c29016p100036822.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-rayas-bajo-desigual-c29016p100123010.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-camisero-vaquero-c29016p100036575.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-midi-estampado-rayas-c29016p100189011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-midi-rayas-manga-3-4-c29016p100149507.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-midi-canal%C3%A9-ajustado-c29016p100149508.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-estampado-bolsillos-c29016p100212503.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-corte-evas%C3%A9-bolsillos-c29016p100189012.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-vaquero-camisero-cuadros-c29016p100036624.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/pichi-vaquero-c29016p100073526.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-estampado-geom%C3%A9trico-cuello-halter-c29016p100037021.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-manga-larga-c29016p100036882.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-manga-larga-c29016p100036882.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-manga-larga-c29016p100036882.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-manga-larga-c29016p100036882.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-jacquard-evas%C3%A9-c29016p100037207.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cr%C3%AApe-evas%C3%A9-estampado-flores-manga-3-4-c29016p100036932.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cr%C3%AApe-evas%C3%A9-estampado-flores-manga-3-4-c29016p100037280.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-perkins-parche-c29016p100037464.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cr%C3%AApe-evas%C3%A9-liso-manga-3-4-c29016p100036930.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cr%C3%AApe-evas%C3%A9-liso-manga-3-4-c29016p100036930.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-alto-liso-c29016p100037156.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-alto-estampado-flores-c29016p100036921.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-cuello-alto-estampado-corbatero-c29016p100037155.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-largo-manga-sisa-c29016p100170011.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-largo-manga-sisa-rayas-c29016p100170012.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-manga-acampanada-c29016p100149506.html', u'http://www.pullandbear.com/es/es/mujer/vestidos/vestido-punto-espalda-abierta-c29016p100195504.html']
如果要编写源代码:
with open("out.html", "w") as f:
f.write(dr.page_source)
这篇关于如何像在浏览器中一样检索确切的HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!