登录网站以使用python进行抓取 [英] login to website for scraping with python

查看:41
本文介绍了登录网站以使用python进行抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从网站上获取遗传途径的链接.首先,我需要登录,但遇到麻烦.我在抓取方面经验很少,因此将非常感谢您提供有关此问题的任何指针或一般性的如何"信息以及准确的答案.

I need to get links to genetic pathways from a website. First I need to login but am having trouble. I have very little experience with scraping so any pointers or general 'how to' information about this will be very much appreciated along with an exact answer.

import requests
from bs4 import BeautifulSoup
URL = 'http://www.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=CP:BIOCARTA'
session1 = requests.Session()
params = {'login':'my_email'}
session2 = session1.post(URL, data=params)

pathways_links = []

for link in soup.find('div', attrs={'id':'wrapper'}).find(
    'div', attrs={'id':'contentwrapper'}).find(
        'div', attrs={'id':'content_navs'}).find(
            'table', attrs={'id':'geneSetTable'}).find('a')['href']:
    pathways_links.append(link)
    print link

不幸的是,它似乎没有登录.我得到:

unfortunately it doesn't seem to log me in. I get:

'div', attrs={'id':'content_navs'}).find(
 AttributeError: 'NoneType' object has no attribute 'find'

如果我要求它在'content_navs'div之前打印链接,那么我得到:

if I ask it to print links before the 'content_navs' div then I get:

<div id="content_full">
<h1>Login to GSEA/MSigDB</h1>
<h2>Login</h2>
<a href="register.jsp"></a>Click here</div>

任何解决方案将不胜感激.谢谢.

Any solutions would be much appreciated. thanks.

推荐答案

您需要先登录 http://www.broadinstitute.org/gsea/login.jsp ,然后转到其他位置.

You need to first login at http://www.broadinstitute.org/gsea/login.jsp and then go to the other location.

第一步是创建一个会话对象;它将保留cookie和其他会话详细信息.接下来,您需要登录,然后最终将内容传递给BeautifulSoup:

The first step, is to create a session object; which will persist cookies and other session details. Next, you need to login and then finally pass the contents to BeautifulSoup:

s = requests.Session()
data = {'j_username': 'you@email.com'}
s.post('http://www.broadinstitute.org/gsea/login.jsp', data=data)
r = s.get('http://www.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=CP:BIOCARTA')
soup = BeautifulSoup(r.content)

# the rest of your code

这篇关于登录网站以使用python进行抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆