使用Python刮取JavaScript生成的页面 [英] Scraping a javascript generated page using Python

查看:201
本文介绍了使用Python刮取JavaScript生成的页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要获取 https://hasjob.co/ 的一些信息,我可以抓取一些信息通过浏览登录页面并像往常一样抓取,但大多数信息都是由Javascript生成的,只有当你向下滚动到页面底部时。



使用python的任何解决方案

  import mechanize 
从bs4导入cookielib
导入BeautifulSoup
导入html2text

导入pprint

工作= []

#浏览器
br = mechanize.Browser()

# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

#浏览器选项
br.set_handle_equiv(True)
br。 set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(),max_time = 1)

br.addheaders = [('User-agent','Chrome')]

#我们将导航到的网站,处理它的会话
br.open('https://auth.hasgeek.com/login')

#查看可用的表单
## for br.forms( ):
## print f

#选择第二个(索引1)表单(第一个表单是搜索查询框)
br.select_form(nr = 1)

#用户凭证
br.form ['username'] ='用户名'
br.form ['password'] ='通过'

br.submit()

## print(br.open('https://hasjob.co/').read())

r = br.open( ')https://hasjob.co/')


soup = BeautifulSoup(r)


for soup.find_all('span ',attrs = {'class':'annotation bottom-right'}):

p = tag.text
job.append(p)


pp = pprint.PrettyPrinter(depth = 6)

pp.pprint(job)


feedparser库从Hasjob读取结构化数据非常简单:

 进口feedparser 
feed = feedparser.parse('https://hasjob.co/feed')
用于Feed中的作业。条目:
print job.title,job.link,job.published,job.content

这个Feed过去已经满30天了,但现在已经超过800个条目,并且在服务器上的负载相当有限,所以我把它缩减到了最后24小时的工作。如果您想要定期提供工作帮助,请每天至少从该URL加载一次。


I need to scarpe some information for https://hasjob.co/, I can scrape some of the information by getting through the login page and scrape as usual, but most of information are generated by Javascript only when u scroll down to the bottom of the page.

Any solution using python??

import mechanize
import cookielib
from bs4 import BeautifulSoup
import html2text

import pprint

job = []

# Browser
br = mechanize.Browser()

# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

br.addheaders = [('User-agent', 'Chrome')]

# The site we will navigate into, handling it's session
br.open('https://auth.hasgeek.com/login')

# View available forms
##for f in br.forms():
##    print f

# Select the second (index one) form (the first form is a search query box)
br.select_form(nr=1)

# User credentials
br.form['username'] = 'username'
br.form['password'] = 'pass'

br.submit()

##print(br.open('https://hasjob.co/').read())

r = br.open('https://hasjob.co/')


soup = BeautifulSoup(r)


for tag in soup.find_all('span',attrs={'class':'annotation bottom-right'}):

    p = tag.text
    job.append(p)


pp = pprint.PrettyPrinter(depth=6)

pp.pprint(job)

解决方案

For some reason almost no one notices that Hasjob has an Atom feed and it's linked from the home page. Reading structured data from Hasjob using the feedparser library is as simple as:

import feedparser
feed = feedparser.parse('https://hasjob.co/feed')
for job in feed.entries:
    print job.title, job.link, job.published, job.content

The feed used to be full 30 days, but that's now over 800 entries and a fair bit of load on the server, so I've cut it down to the last 24 hours of jobs. If you want a regular helping of jobs, just load from this URL at least once a day.

这篇关于使用Python刮取JavaScript生成的页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆