以编程方式下载未出现在页面源代码中的文本 [英] Programmatically download text that doesn't appear in the page source
问题描述
我正在用 Python 编写一个爬虫.给定一个网页,我以下列方式提取它的 Html
内容:
I'm writing a crawler in Python.
Given a single web page, I extract it's Html
content in the following manner:
import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()
但是某些文本组件不会出现在 Html 页面源中,例如在 此页面(重定向到索引,请访问其中一个日期并查看特定邮件)如果您查看页面源,您将看到邮件文本未出现在源中,但似乎由 JS 加载.
But some text components don't appear in the Html page source, for example in this page (redirected to the index, please access one of the dates and view a specific mail) if you view page source you will see that the mail text doesn't appear in the source but seems to be loaded by JS.
如何以编程方式下载此文本?
How can I programmatically download this text?
推荐答案
这里最简单的选择是向负责电子邮件搜索的 URL 发出 POST 请求并解析 JSON 结果(提到 @recursive,因为他提出了这个想法第一的).使用 requests
包的示例:
The easiest option here would be to make a POST request to the URL responsible for the email search and parse the JSON results (mentioning @recursive since he suggested the idea first). Example using requests
package:
import requests
data = {
'year': '1999',
'month': '05',
'day': '20',
'locale': 'en-us'
}
response = requests.post('http://jebbushemails.com/api/email.py', data=data)
results = response.json()
for email in results['emails']:
print email['dateCentral'], email['subject']
打印:
1999-05-20T00:48:23-05:00 Re: FW: The Reason Study of Rail Transportation in Hillsborough
1999-05-20T04:07:26-05:00 Escambia County School Board
1999-05-20T06:29:23-05:00 RE: Escambia County School Board
...
1999-05-20T22:56:16-05:00 RE: School Board
1999-05-20T22:56:19-05:00 RE: Emergency Supplemental just passed 64-36
1999-05-20T22:59:32-05:00 RE:
1999-05-20T22:59:33-05:00 RE: (no subject)
<小时>
这里的另一种方法是让真正的浏览器在 selenium
浏览器自动化框架:
A different approach here would be to let a real browser handle the dynamic javascript part of the page load with the help of selenium
browser automation framework:
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome() # can also be, for example, webdriver.Firefox()
driver.get('http://jebbushemails.com/email/search')
# click 1999-2000
button = driver.find_element_by_xpath('//button[contains(., "1999 – 2000")]')
button.click()
# click 20
cell = driver.find_element_by_xpath('//table[@role="grid"]//span[. = "20"]')
cell.click()
# click Submit
submit = driver.find_element_by_xpath('//button[span[1]/text() = "Submit"]')
submit.click()
# wait for result to appear
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//tr[@analytics-event]")))
# get the results
for row in driver.find_elements_by_xpath('//tr[@analytics-event]'):
date, subject = row.find_elements_by_tag_name('td')
print date.text, subject.text
打印:
6:24:27am Fw: Support Coordination
6:26:18am Last nights meeting
6:52:16am RE: Support Coordination
7:09:54am St. Pete Times article
8:05:35am semis on the interstate
...
6:07:25pm Re: Appointment
6:18:07pm Re: Mayor Hood
8:13:05pm Re: Support Coordination
注意这里的浏览器也可以是headless,比如PhantomJS
.而且,如果浏览器没有显示可运行 - 您可以启动一个虚拟机,请参阅此处的示例:
Note that a browser here can also be headless, like PhantomJS
. And, if there is no display for browser to work in - you can fire up a virtual one, see examples here:
这篇关于以编程方式下载未出现在页面源代码中的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!