Python请求在这里做错了什么,或者我的POST请求缺少什么? [英] Is Python requests doing something wrong here, or is my POST request lacking something?

查看:163
本文介绍了Python请求在这里做错了什么,或者我的POST请求缺少什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在编写一个程序,帮助用户确定在tumblr上发布帖子的最佳时间。与Twitter一样,大多数粉丝拥有如此多的订阅,以至于他们无法跟上,这意味着最好知道一个人自己的特定关注时间(主要是)在线。在tumblr上,这可以通过两种方式确定 - 首先是他们最近是否分享了最近发布的任何内容,其次是他们最近是否已添加到他们喜欢的帖子列表。

I'm currently writing a program which will help users to determine optimal times to make a post on tumblr. As with Twitter, most followers have so many subscriptions that there is no way they can keep up, meaning it's best to know when one's own specific following is (mostly) online. On tumblr this can be determined in two ways -- first whether they have recently shared any content which was recently posted, and secondly whether they have recently added to their liked-posts list.

令人沮丧的是,即使设置为公开,任意用户(非自己)的喜欢帖子流仅可用于登录实体。据我所知,这意味着我要经常上传一个登录cookie到应用程序,或者让这个请求后工作。

Frustratingly, even when set to 'public', the liked-posts stream of an arbitrary user (other than self) is only available to logged-in entities. As far as I know, that means I've either got to upload a login-cookie to the application every so often, or get this post-request working.

I通过Opera的检查员查看了一些成功的出站请求,但我仍然遗漏了一些东西,或者请求正在做一些服务器拒绝的事情。无论我做什么。

I've looked at a number of successful outbound requests via Opera's inspector but I must still be missing something, or perhaps requests is doing something that the server is rejecting no matter what I do.

问题的实质如下。目前这是用Python 2.7
编写的,并使用 Python请求 BeautifulSoup 。要自己运行它,请将get_login_response()顶部的e和p对更新为一组实际值。

The essence of the problem is below. This is currently written in Python 2.7 and uses Python requests and BeautifulSoup. To run it yourself, update the e and p pair at the top of get_login_response() to a real set of values.

import requests
from bs4 import BeautifulSoup

class Login:

    def __init__(self):
        self.session = requests.session()

    def get_hidden_fields(self):
        """ -> string. tumblr dynamically generates a key for its login forms
        This should extract that key from the form so that the POST-data to
        login will be accepted.
        """
        pageRequest = requests.Request("GET","https://www.tumblr.com/login")
        received = self.session.send( pageRequest.prepare() )
        html = BeautifulSoup(received.content)
        hiddenFieldDict = {}
        hiddenFields = html.find_all("input",type="hidden")
        for x in hiddenFields: hiddenFieldDict[x["name"]]=x["value"]
        return hiddenFieldDict

    def get_login_response(self):
        e = u"dead@live.com"
        p = u"password"
        endpoint = u"https://tumblr.com/login"
        payload = { u"user[email]": e,
                    u"user[password]": p,
                    u"user[age]":u"",
                    u"tumblelog[name]": u"",
                    u"host": u"www.tumblr.com",
                    u"Connection:":u"keep-alive",
                    u"Context":u"login",
                    u"recaptcha_response_field":u""
                  }
        payload.update( self.get_hidden_fields() )
    ##        headers = {"Content-Type":"multipart/form-data"}
        headers = {u"Content-Type":u"application/x-www-form-urlencoded",
                   u"Connection:":u"keep-alive",
                   u"Origin":u"https://tumblr.com",
                   u"Referer": u"https://www.tumblr.com/login",
                   u"User-Agent":u"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 OPR/18.0.1284.68",
                   u"Accept":u"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
                   u"Accept-Encoding":u"gzip,deflate,sdch",
                   u"Accept-Language":u"en-US,en;q=0.8",
                   u"Cache-Control":u"max-age=0"
                   #"Content-Length":VALUE is still needed
                   }
        # this cookie is stale but it seems we these for free anyways,
        #  so I'm not sure whether it's actually needed. It's mostly
        #  google analytics info.
        sendCookie = {"tmgioct":"52c720e28536530580783210",
                      "__qca":"P0-1402443420-1388781796773",
                      "pfs":"POIPdNt2p1qmlMGRbZH5JXo5k",
                      "last_toast":"1388783309",
                      "capture":"GDTLiEN5hEbMxPzys1ye1Gf4MVM",
                      "logged_in":"0",
                      "_ga":"GA1.2.2064992906.1388781797",
                      "devicePixelRatio":"1",
                      "documentWidth":"1280",
                      "anon_id":"VNHOJWQXGTQXHNCFKYJQUMUIVQBRISPR",
                      "__utma":"189990958.2064992906.1388781797.1388781797.1388781797.1",
                      "__utmb":"189990958.28.10.1388781797",
                      "__utmc":"189990958",
                      "__utmz":"189990958.1388781797.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"}
        loginRequest = requests.Request("POST",
                                        endpoint,
                                        headers,
                                        data=payload,
                                        cookies=sendCookie # needed?
##                                        ,auth=(e,p) # may not be needed
                                        )

        contentLength = len(loginRequest.prepare().body)
        loginRequest.data.update({u"Content-Length":unicode(contentLength)})
        return self.session.send( loginRequest.prepare() )

l = Login()
res = l.get_login_response()
print "All cookies: ({})".format(len(l.session.cookies))
print l.session.cookies # has a single generic cookie from the initial GET query
print "Celebrate if non-empty:"
print res.cookies # this should theoretically contain the login cookie

我的结果输出:

All cookies: (1)
<<class 'requests.cookies.RequestsCookieJar'>[<Cookie tmgioct=52c773ed65cfa30622446430 for www.tumblr.com/>]>
Celebrate if non-empty:
<<class 'requests.cookies.RequestsCookieJar'>[]>

如果我的代码不安全,你还可以获得奖励积分。我选择了请求模块,因为它简单,但如果它缺少功能,我可以使用 httplib2 或我愿意切换的东西。

Bonus points if my code is insecure and you have pointers for me on that in addition . I chose requests module for its simplicity, but if it lacks features and my goal is possible using httplib2 or something I am willing to switch.

推荐答案

您需要做的事情有很多,而且 做了很多事情你没有。

There are a number of things you're not doing that you need to be, and quite a few things you are doing that you don't.

首先,回过头来检查你的登录请求发送的POST字段。当我在Chrome中执行此操作时,会看到以下内容:

Firstly, go back and examing the POST fields being sent on your login request. When I do this in Chrome, I see the following:

user[email]:<redacted>
user[password]:<redacted>
tumblelog[name]:
user[age]:
recaptcha_public_key:6Lf4osISAAAAAJHn-CxSkM9YFNbirusAOEmxqMlZ
recaptcha_response_field:
context:other
version:STANDARD
follow:
http_referer:http://www.tumblr.com/logout
form_key:!1231388831237|jS7l2SHeUMogRjxRiCbaJNVduXU
seen_suggestion:0
used_suggestion:0

您的基于请求的POST缺少其中一些字段,特别是 recaptcha_public_key 版本关注 http_referer form_key seen_suggestion used_suggestion

Your Requests-based POST is missing a few of these fields, specifically recaptcha_public_key, version, follow, http_referer, form_key, seen_suggestion and used_suggestion.

这些字段不是可选的:它们需要在此POST上发送。其中一些可以安全地使用,但最安全获取这些的方法是获取登录页面本身的数据,并使用BeautifulSoup从HTML中提取值。我将假设您已经掌握了相应的技能(例如,您知道如何在HTML中查找表单输入并解析它们以获取其默认值)。

These fields are not optional: they will need to be sent on this POST. Some of these can safely be used generically, but the safest way to get these is to get the data for the login page itself, and use BeautifulSoup to pull the values out of the HTML. I'm going to assume you've got the skillset to do that (e.g. you know how to find form inputs in HTML and parse them to get their default values).

进入这里的一个好习惯是开始使用Wireshark或tcpdump之类的工具来检查您的HTTP流量请求,并将其与Chrome / Opera中的内容进行比较。这将允许您查看发送和不发送的内容,以及两个请求的不同之处。

A good habit to get in here is to start using a tool like Wireshark or tcpdump to examine your requests HTTP traffic, and compare it to what you get from Chrome/Opera. This will allow you to see what is and isn't being sent, and how the two requests differ.

其次,一旦您开始点击登录页面,您将不会需要在你的POST上发送cookie,所以你可以停止这样做。更一般地说,当使用请求 Session 对象时,您不应该输入任何其他cookie:只是模拟来自实际浏览器的HTTP请求流,并且您的cookie状态将是正常的。

Secondly, once you start hitting the login page you won't need to send cookies on your POST, so you can stop doing that. More generally, when using a requests Session object, you shouldn't input any additional cookies: just emulate the flow of HTTP requests from an actual browser and your cookies state will be fine.

第三,你大量过度指定你的标题词典。您提供的大多数字段将由请求自动填充。现在,考虑到你试图模仿一个浏览器(Opera的外观),你会想要覆盖它们中的一些,但大多数都可以单独使用。您应该使用此标题词典:

Thirdly, you're massively over-specifying your headers dictionary. Most of the fields you're providing will be automatically populated by Requests. Now, given that you're trying to emulate a browser (Opera by the looks of things), you will want to override a few of them, but most can be left alone. You should be using this header dictionary:

{
    u"Origin":u"https://tumblr.com",
    u"Referer": u"https://www.tumblr.com/login",
    u"User-Agent":u"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 OPR/18.0.1284.68",
    u"Accept":u"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    u"Accept-Language":u"en-US,en;q=0.8",
}

下面是我从标题词典中删除的字段列表以及删除它们的原因:

Below is a list of the fields I removed from your header dictionary and why I removed them:


  • Content-Type :当您在请求中为 data 参数提供字典时,我们设置Content-Type到你的 application / x-www-form-urlencoded 。没有必要自己动手。

  • 连接:请求管理HTTP连接池并自行保持:不参与此过程,它'只会出错。

  • 接受编码:请再次请求请求设置此项,除非您真的准备处理解码内容。请求只知道如何做 gzip deflate :如果你发送 sdch 并实际取回它,你必须自己解码。最好不要宣传你支持它。

  • 缓存控制:POST请求无法缓存,因此无关紧要。

  • Content-Type: When you provide a dictionary to the data argument in Requests, we set the Content-Type to application/x-www-form-urlencoded for you. There's no need to do it yourself.
  • Connection: Requests manages HTTP connection pooling and keep-alives itself: don't get involved in the process, it'll just go wrong.
  • Accept-Encoding: Again, please let Requests set this unless you're actually prepared to deal with decoding the content. Requests only knows how to do gzip and deflate: if you send sdch and actually get it back, you'll have to decode it yourself. Best not to advertise you support it.
  • Cache-Control: POST requests cannot be cached, so this is irrelevant.

第四,我想在这里非常清楚,不要自己计算Content-Length 。请求将为您完成,并将正确。如果您自己发送该标题,那么Requests核心开发团队必须追逐各种奇怪的错误。没有充分的理由自己设置标题。考虑到这一点,您可以停止使用 PreparedRequest 对象,然后返回使用 session.post()

Fourth, and I want to be very clear here, do not calculate Content-Length yourself. Requests will do it for you and will get it right. If you send that header yourself, all kinds of weird bugs can come up that the Requests core dev team have to chase. There is never a good reason to set that header yourself. With this in mind, you can stop using PreparedRequest objects and just go back to using session.post().

这篇关于Python请求在这里做错了什么,或者我的POST请求缺少什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆