使用Python刮取JavaScript生成的页面 [英] Scraping a javascript generated page using Python

查看：201 发布时间：2018/6/26 19:51:27 javascript jquery python html web-scraping

本文介绍了使用Python刮取JavaScript生成的页面的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要获取 https://hasjob.co/ 的一些信息，我可以抓取一些信息通过浏览登录页面并像往常一样抓取，但大多数信息都是由Javascript生成的，只有当你向下滚动到页面底部时。

使用python的任何解决方案

  import mechanize 
从bs4导入cookielib 
导入BeautifulSoup 
导入html2text 
 
导入pprint 
 
工作= [] 
 
＃浏览器
 br = mechanize.Browser（）
 
＃ Cookie Jar 
 cj = cookielib.LWPCookieJar（）
 br.set_cookiejar（cj）
 
＃浏览器选项
 br.set_handle_equiv（True）
 br。 set_handle_redirect（True）
 br.set_handle_referer（True）
 br.set_handle_robots（False）
 br.set_handle_refresh（mechanize._http.HTTPRefreshProcessor（），max_time = 1）
 
 br.addheaders = [（'User-agent'，'Chrome'）] 
 
＃我们将导航到的网站，处理它的会话
 br.open（'https://auth.hasgeek.com/login'）
 
＃查看可用的表单
 ## for br.forms（ ）：
 ## print f 
 
＃选择第二个（索引1）表单（第一个表单是搜索查询框）
 br.select_form（nr = 1）
 
＃用户凭证
 br.form ['username'] ='用户名'
 br.form ['password'] ='通过'
 
 br.submit（）
 
 ## print（br.open（'https://hasjob.co/'）.read（））
 
r = br.open（ '）https://hasjob.co/'）
 
 
 soup = BeautifulSoup（r）
 
 
 for soup.find_all（'span '，attrs = {'class'：'annotation bottom-right'}）：
 
p = tag.text 
 job.append（p）
 
 
 pp = pprint.PrettyPrinter（depth = 6）
 
 pp.pprint（job）

feedparser库从Hasjob读取结构化数据非常简单：
进口feedparser feed = feedparser.parse（'https://hasjob.co/feed'）用于Feed中的作业。条目： print job.title，job.link，job.published，job.content
这个Feed过去已经满30天了，但现在已经超过800个条目，并且在服务器上的负载相当有限，所以我把它缩减到了最后24小时的工作。如果您想要定期提供工作帮助，请每天至少从该URL加载一次。

I need to scarpe some information for https://hasjob.co/, I can scrape some of the information by getting through the login page and scrape as usual, but most of information are generated by Javascript only when u scroll down to the bottom of the page.

Any solution using python??
import mechanize import cookielib from bs4 import BeautifulSoup import html2text import pprint job = [] # Browser br = mechanize.Browser() # Cookie Jar cj = cookielib.LWPCookieJar() br.set_cookiejar(cj) # Browser options br.set_handle_equiv(True) br.set_handle_redirect(True) br.set_handle_referer(True) br.set_handle_robots(False) br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) br.addheaders = [('User-agent', 'Chrome')] # The site we will navigate into, handling it's session br.open('https://auth.hasgeek.com/login') # View available forms ##for f in br.forms(): ## print f # Select the second (index one) form (the first form is a search query box) br.select_form(nr=1) # User credentials br.form['username'] = 'username' br.form['password'] = 'pass' br.submit() ##print(br.open('https://hasjob.co/').read()) r = br.open('https://hasjob.co/') soup = BeautifulSoup(r) for tag in soup.find_all('span',attrs={'class':'annotation bottom-right'}): p = tag.text job.append(p) pp = pprint.PrettyPrinter(depth=6) pp.pprint(job)

解决方案
For some reason almost no one notices that Hasjob has an Atom feed and it's linked from the home page. Reading structured data from Hasjob using the feedparser library is as simple as:
import feedparser feed = feedparser.parse('https://hasjob.co/feed') for job in feed.entries: print job.title, job.link, job.published, job.content
The feed used to be full 30 days, but that's now over 800 entries and a fair bit of load on the server, so I've cut it down to the last 24 hours of jobs. If you want a regular helping of jobs, just load from this URL at least once a day.

这篇关于使用Python刮取JavaScript生成的页面的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Python刮取JavaScript生成的页面 [英] Scraping a javascript generated page using Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用Python刮取JavaScript生成的页面 [英] Scraping a javascript generated page using Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭