登录网站以使用python进行抓取 [英] login to website for scraping with python
问题描述
我需要从网站上获取遗传途径的链接.首先,我需要登录,但遇到麻烦.我在抓取方面经验很少,因此将非常感谢您提供有关此问题的任何指针或一般性的如何"信息以及准确的答案.
I need to get links to genetic pathways from a website. First I need to login but am having trouble. I have very little experience with scraping so any pointers or general 'how to' information about this will be very much appreciated along with an exact answer.
import requests
from bs4 import BeautifulSoup
URL = 'http://www.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=CP:BIOCARTA'
session1 = requests.Session()
params = {'login':'my_email'}
session2 = session1.post(URL, data=params)
pathways_links = []
for link in soup.find('div', attrs={'id':'wrapper'}).find(
'div', attrs={'id':'contentwrapper'}).find(
'div', attrs={'id':'content_navs'}).find(
'table', attrs={'id':'geneSetTable'}).find('a')['href']:
pathways_links.append(link)
print link
不幸的是,它似乎没有登录.我得到:
unfortunately it doesn't seem to log me in. I get:
'div', attrs={'id':'content_navs'}).find(
AttributeError: 'NoneType' object has no attribute 'find'
如果我要求它在'content_navs'div之前打印链接,那么我得到:
if I ask it to print links before the 'content_navs' div then I get:
<div id="content_full">
<h1>Login to GSEA/MSigDB</h1>
<h2>Login</h2>
<a href="register.jsp"></a>Click here</div>
任何解决方案将不胜感激.谢谢.
Any solutions would be much appreciated. thanks.
推荐答案
您需要先登录 http://www.broadinstitute.org/gsea/login.jsp
,然后转到其他位置.
You need to first login at http://www.broadinstitute.org/gsea/login.jsp
and then go to the other location.
第一步是创建一个会话对象;它将保留cookie和其他会话详细信息.接下来,您需要登录,然后最终将内容传递给BeautifulSoup:
The first step, is to create a session object; which will persist cookies and other session details. Next, you need to login and then finally pass the contents to BeautifulSoup:
s = requests.Session()
data = {'j_username': 'you@email.com'}
s.post('http://www.broadinstitute.org/gsea/login.jsp', data=data)
r = s.get('http://www.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=CP:BIOCARTA')
soup = BeautifulSoup(r.content)
# the rest of your code
这篇关于登录网站以使用python进行抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!