以编程方式下载未出现在页面源代码中的文本 [英] Programmatically download text that doesn't appear in the page source

查看:15
本文介绍了以编程方式下载未出现在页面源代码中的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用 Python 编写一个爬虫.给定一个网页,我以下列方式提取它的 Html 内容:

I'm writing a crawler in Python. Given a single web page, I extract it's Html content in the following manner:

import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()

但是某些文本组件不会出现在 Html 页面源中,例如在 此页面(重定向到索引,请访问其中一个日期并查看特定邮件)如果您查看页面源,您将看到邮件文本未出现在源中,但似乎由 JS 加载.

But some text components don't appear in the Html page source, for example in this page (redirected to the index, please access one of the dates and view a specific mail) if you view page source you will see that the mail text doesn't appear in the source but seems to be loaded by JS.

如何以编程方式下载此文本?

How can I programmatically download this text?

推荐答案

这里最简单的选择是向负责电子邮件搜索的 URL 发出 POST 请求并解析 JSON 结果(提到 @recursive,因为他提出了这个想法第一的).使用 requests 包的示例:

The easiest option here would be to make a POST request to the URL responsible for the email search and parse the JSON results (mentioning @recursive since he suggested the idea first). Example using requests package:

import requests

data = {
    'year': '1999',
    'month': '05',
    'day': '20',
    'locale': 'en-us'
}
response = requests.post('http://jebbushemails.com/api/email.py', data=data)

results = response.json()
for email in results['emails']:
    print email['dateCentral'], email['subject']

打印:

1999-05-20T00:48:23-05:00 Re: FW: The Reason Study of Rail Transportation in Hillsborough
1999-05-20T04:07:26-05:00 Escambia County School Board
1999-05-20T06:29:23-05:00 RE: Escambia County School Board
...
1999-05-20T22:56:16-05:00 RE: School Board
1999-05-20T22:56:19-05:00 RE: Emergency Supplemental just passed 64-36
1999-05-20T22:59:32-05:00 RE:
1999-05-20T22:59:33-05:00 RE: (no subject)

<小时>

这里的另一种方法是让真正的浏览器在 selenium 浏览器自动化框架:


A different approach here would be to let a real browser handle the dynamic javascript part of the page load with the help of selenium browser automation framework:

# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Chrome()  # can also be, for example, webdriver.Firefox()
driver.get('http://jebbushemails.com/email/search')

# click 1999-2000
button = driver.find_element_by_xpath('//button[contains(., "1999 – 2000")]')
button.click()

# click 20
cell = driver.find_element_by_xpath('//table[@role="grid"]//span[. = "20"]')
cell.click()

# click Submit
submit = driver.find_element_by_xpath('//button[span[1]/text() = "Submit"]')
submit.click()

# wait for result to appear
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//tr[@analytics-event]")))

# get the results
for row in driver.find_elements_by_xpath('//tr[@analytics-event]'):
    date, subject = row.find_elements_by_tag_name('td')
    print date.text, subject.text

打印:

6:24:27am Fw: Support Coordination
6:26:18am Last nights meeting
6:52:16am RE: Support Coordination
7:09:54am St. Pete Times article
8:05:35am semis on the interstate
...
6:07:25pm Re: Appointment
6:18:07pm Re: Mayor Hood
8:13:05pm Re: Support Coordination

注意这里的浏览器也可以是headless,比如PhantomJS.而且,如果浏览器没有显示可运行 - 您可以启动一个虚拟机,请参阅此处的示例:

Note that a browser here can also be headless, like PhantomJS. And, if there is no display for browser to work in - you can fire up a virtual one, see examples here:

这篇关于以编程方式下载未出现在页面源代码中的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆