从网页抓取表格 [英] Crawling tables from webpage

查看:130
本文介绍了从网页抓取表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从此网页提取csu员工薪水数据(

I'm trying to extract csu employee salary data from this webpage (http://www.sacbee.com/statepay/#req=employee%2Fsearch%2Fname%3D%2Fyear%3D2013%2Fdepartment%3DCSU%20Sacramento). I've tried using urlib2 and requests library, but none of them returned the actual table from the webpage. I guessed the reason could be that the table was generated dynamically by javascript. Below is my code using requests.

from lxml import html
import requests

page = requests.get("http://www.sacbee.com/statepay/#req=employee%2Fsearch%2Fname%3D%2Fyear%3D2013%2Fdepartment%3DCSU%20Sacramento")
tree = html.fromstring(page.text)
name = tree.xpath('//table/tbody/tr/td[2]/text()'

任何帮助/评论将不胜感激.

Any help/comments will be highly appreciated.

推荐答案

根据我的评论,这是我的尝试.请注意,我只提取了一行数据.所有其他一切都取决于您.

Here's my attempt on it, as per my comment. Note that I only pulled out one line of data. All else is up to you.

代码:

import requests as rq

url = "http://api.sacbeelabs.com/v1/statepay/employee/search/name=/year=2013/department=CSU%20Sacramento.json"
data = "74XoegZ494trsvrus_As4B4handjZ494-Adl4B4olg494dnnk933pppAmWYXaaAYjh3mnWnakWq3-Ela-B-Oahkgjqaa07tw8tJmaWlYd07tw8tJiWha07tw8uH07tw8tJqaWl07tw8uHtrsu07tw8tJZakWlnhain07tw8uHGT-107tw8trTWYlWhainj4B4labalal494dnnk933mnWYfj-8albgjpAYjh3-Boamnejim3tt_v_rt_3YlWpgeic1nWXgam1bljh1paXkWca4B4nenga494TnWnaDVjlfalDTWgWlqDTaWlYdD1DUdaDTWYlWhainjDFaaBDTWYlWhainjBDGWgebjlieW4B4mYlV49sxzrB4mYlL49srwrB4peiV49sxzrB4peiL49_stB4oW4974Wcain494Oj-CeggW3wArD-I-6ss-MD-1Xoino-MDNeio-AD-Azx2xv-MDl-89tzAr-JDKaYfj3trsrrsrsDJelabj-A3tzAr4B4njoYd49bWgmaB4Zjh4954mnjlWca4B4WiehWneji4B4YWi-8WmtZ4B4paXmjYfan4B4pjlfal4B4WoZej4B4-8eZaj4B4m-8c4B4cajgjY46B4Ymm4954WiehWneji4B4nlWimbjlh468B4omal4974Woi494Koamn488"
headers = {
'Host': 'api.sacbeelabs.com',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'X-SBAPI-Auth-Token': '0QNWbefXw6fQQcWXqK8vDw',
'X-SBAPI-SID': '3gbRqglHXAVDy1vwdcVVMf',
'X-SBAPI-CID': '2HuWho39ZcDUlTswYSWUd9',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Referer': 'http://www.sacbee.com/statepay/',
'Content-Length': '684',
'Origin': 'http://www.sacbee.com',
'Cookie': 'sbapi-cid=2HuWho39ZcDUlTswYSWUd9; sbapi-sid=3gbRqglHXAVDy1vwdcVVMf',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}

r = rq.post(url, data=data, headers=headers)
json_data = r.json()

base = json_data["result"]["employees"][0] # First employee.

name = base["name"]
first_name = name["first"]
last_name = name["last"]

pay = base["pay"]["total"]

title = base["title"]
dept = base["department"]

print first_name, last_name, pay, title, dept
# Your turn here...

结果:

Clayton Abajian 9844 Lecturer - Academic Year CSU Sacramento
[Finished in 0.9s]

这篇关于从网页抓取表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆