使用python登录网页来抓取数据 [英] log in to webpage with python to scrape data

查看:190
本文介绍了使用python登录网页来抓取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试构建一个webscraper来从MWO Mercs中提取我的统计数据。要做到这一点,有必要登录页面然后浏览6个不同的统计页面来获取数据(这将在以后进入数据库,但这不是我的问题)。

I am trying to build a webscraper to extract my stats data from MWO Mercs. To do so it is necessary to login to the page and then go through the 6 different stats pages to get the data (this will go into a data base later but that is not my question).

登录表格如下(来自 https: //mwomercs.com/login?return=/profile/stats?type=mech)- 从我看到的有两个字段需要数据 EMAIL PASSWORD 并且需要发布。然后应该打开 http://mwomercs.com/profile/stats?type=mech 。之后,我需要一个会话来循环浏览各种统计页面。

The login form is given below (from https://mwomercs.com/login?return=/profile/stats?type=mech)- from what I see there are two fields that need data EMAIL and PASSWORD and need to be posted. It should then open http://mwomercs.com/profile/stats?type=mech . After that I need have a session to cycle through the various stats pages.

我尝试使用 urllib mechanize 请求但我完全无法找到正确答案 - 我更愿意使用请求

I have tried using urllib, mechanize and requests but I have been totally unable to find the right answer - I would prefer to use requests.

我确实意识到在stackoverflow中已经提出了类似的问题,但我已经搜索了很长时间但没有成功。

I do realise that similar questions have been asked in stackoverflow but I have searched for a very long time with no success.

感谢您提供的任何帮助

<div id="stubPage">
    <div class="container">
        <h1 id="stubPageTitle">LOGIN</h1>
        <div id="loginForm">
            <form action="/do/login" method="post">

                <legend>MechWarrior Online <a href="/signup" class="btn btn-warning pull-right">REGISTER</a></legend>


                <label>Email Address:</label>
                <div class="input-prepend"><span class="add-on textColorBlack textPlain">@</span><input id="email" name="email" class="span4" size="16" type="text" placeholder="user@example.org"></div>

                <label>Password:</label>

                <div class="input-prepend"><span class="add-on"><span class="icon-lock"></span></span><input id="password" name="password" class="span4" size="16" type="password"></div>

                <br>
                <button type="submit" class="btn btn-large btn-block btn-primary">LOGIN</button>

                <br>
                <span class="pull-right">[ <a href="#" id="forgotLink">Forgot Your Password?</a> ]</span>

                <br>
                <input type="hidden" name="return" value="/profile/stats?type=mech">
            </form>
        </div>
    </div>
</div>


推荐答案

请求文档非常简单,易于理解它提交表单数据。请给它一个通读:更复杂POST请求

The Requests documentation is very simple and easy to follow when it comes to submitting form data. Please give this a read-through: More Complicated POST requests

登录通常归结为保存cookie并将其发送给将来的请求。

Logins usually come down to saving the cookie and sending it with future requests.

使用 requests.post() POST到登录页面后,使用请求对象来解除cookie。这是一种方法:

After you POST to the login page with requests.post(), use the request object to retieve the cookies. This is one way to do it:

post_headers = {'content-type': 'application/x-www-form-urlencoded'}
payload = {'username':username, 'password':password}
login_request = requests.post(login_url, data=payload, headers=post_headers)
cookie_dict = login_request.cookies.get_dict()
stats_reqest = requests.get(stats_url, cookies=cookie_dict)

如果仍有问题,请使用 login_request.status_code 检查请求中的返回代码,或使用 login_request.text

If you still have problems, check the return code from the request with login_request.status_code or the page content for an error with login_request.text

编辑:

有些网站会在您提出请求时多次重定向您。确保检查 request.history 对象,看看发生了什么以及为什么会被退出。例如,我会一直得到这样的重定向:

Some sites will redirect you several times when you make a request. Make sure to check the request.history object to see what happened and why you got bounced out. For example, I get redirects like this all of the time:

>>> some_request.history
(<Response [302]>, <Response [302]>)

历史元组中的每个项目都是另一个请求。您可以像普通请求对象一样检查它们,例如 request.history [0] .url ,您可以通过放置 allow_redirects = False来禁用重定向/ code>在您的请求参数中:

Each item in the history tuple is another request. You can inspect them like normal requests objects, such as request.history[0].url and you can disable the redirects by putting allow_redirects=False in your request parameters:

login_request = requests.post(login_url, data=payload, headers=post_headers, allow_redirects=False)

在某些情况下,我不得不禁止重定向并添加新的Cookie进入正确的页面。尝试使用类似的东西来保存现有的cookie并添加新的cookie:

In some cases, I've had to disallow redirects and add new cookies before progressing to the proper page. Try using something like this to keep your existing cookies and add the new cookies to it:

cookie_dict = dict(cookie_dict.items() + new_request.cookies.get_dict().items())

在每个请求保留后执行此操作您的Cookie是您的下一个请求的最新版本,类似于您的浏览器。

Doing this after each request will keep your cookies up-to-date for your next request, similar to how your browser would.

这篇关于使用python登录网页来抓取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆