Python Web抓取CSRF令牌问题 [英] Python Web-Scraping CSRF Token Issue

查看:101
本文介绍了Python Web抓取CSRF令牌问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用MechanicalSoup通过Python 3.6登录网站,并且CSRF令牌存在问题.

I am using MechanicalSoup to login to a website via Python 3.6 and I'm having issues with the CSRF token.

每次我请求返回html时,都会读取无效的CSRF令牌:禁止".在登录页面上搜索html时,看起来像令牌的元素ID的最接近匹配项是"autheticity_token",似乎已经用令牌填充了.

Every time i request the html back i read "Invalid CSRF token: Forbidden". Searching the html on the login page, the closest match for a element id that looks like the token is "autheticity_token" which appears to be already filled in with the token.

我能够使用"re"模块提取令牌,然后将其重新提交给具有我上面提到的id的元素,但是没有运气.请注意,由于没有为其提供名称,因此我不得不通过id查找该元素(这就是为什么我的Robobrowser这样做不起作用的原因.)

I was able to use "re" module to extract the token and resubmit it to the element with the id i talked about above but no luck. Note, i had to find the element by id since a name is not provided for it (this is why my Robobrowser way of doing it didn't work).

我认为这是与CSRF相对应的元素:

This is the element that I think corresponds to the CSRF:

<input id="authenticity_token" type="hidden" value="b+csp/9zR/a1yfuPPIYJSiR0v8jJUTaJaGqJmJPmLmivSn4GtLgvek0nyPvcJ0aOgeo0coHpl94MuH/r1OK5UA==">

在这种情况下,我将提取"b + csp/9zR/a1yfuPPIYJSiR0v8jJUTaJaGqJmJPmLmivSn4GtLgvek0nyPvcJ0aOgeo0coHpl94MuH/r1OK5UA ==并将其重新提交给该元素

I would extract, in this case "b+csp/9zR/a1yfuPPIYJSiR0v8jJUTaJaGqJmJPmLmivSn4GtLgvek0nyPvcJ0aOgeo0coHpl94MuH/r1OK5UA==" and resubmit it to that element

这是我的代码,其中包含用于user,pass和url的虚拟值

Here is my code with dummy values for user,pass, and url

import mechanicalsoup
import re

def return_token(str1):
    match1 = "authenticity_token"
    match2 = ".*value\=\"(.*)\".*"
    for x in range(len(str1)):
        line = str1[x]
        if re.findall(match1,line):
            token = re.findall(match2,line)[0]
            return token

url1 = ""
username = ""
password = ""

browser = mechanicalsoup.Browser()
page = browser.get(url1)
str0 = page.text
token = return_token(str0.split('\n'))
#print(str0)
form = page.soup.find("form",{"id":"loginForm"})

form.find('input', {'name': 'username'})['value'] = username
form.find('input', {'name': 'password'})['value'] = password
form.find('input', {'id': 'authenticity_token'})['value'] = str(token)

response = browser.submit(form, page.url)
print(response.text)

推荐答案

我相信这里的问题是< input> 元素必须具有 name 属性才能通过POST或GET提交.由于您的令牌位于 name -less < input> 元素中,因此MechanicalSoup不会对其进行处理,因为浏览器会这样做.

I believe the issue here is that <input> elements must have name attributes for them to be submitted via POST or GET. Since your token is in a name-less <input> element, it is not processed by MechanicalSoup because that's what the browser would do.

W3C规范中:

每个成功的控件都有其控件名称和其当前值配对,这是提交的表单数据集的一部分.成功的控件必须在FORM元素内定义,并且必须具有控件名称.

Every successful control has its control name paired with its current value as part of the submitted form data set. A successful control must be defined within a FORM element and must have a control name.

...

控件的控件名称"由其name属性给出.

A control's "control name" is given by its name attribute.

也许有一些JavaScript正在处理CSRF令牌.

Perhaps there is some JavaScript that is handling the CSRF token.

有关类似的讨论,请参见如果输入标签没有名称,表单数据是否仍在传输?

For a similar discussion, see Does form data still transfer if the input tag has no name?

关于 MechanicalSoup 的用法,类 StatefulBrowser Form 将简化您的脚本.例如,如果您只需要打开页面并输入用户名和密码:

Regarding your usage of MechanicalSoup, the classes StatefulBrowser and Form would simplify your script. For example, if you just had to open the page and input a username and password:

import mechanicalsoup

# These values are filled by the user
url = ""
username = ""
password = ""

# Open the page
browser = mechanicalsoup.StatefulBrowser(raise_on_404=True)
browser.open(url)

# Fill in the form values
form = browser.select_form('form[id=loginForm]')
form['username'] = username
form['password'] = password

# Submit the form and print the resulting page text
response = browser.submit_selected()
print(response.text)

这篇关于Python Web抓取CSRF令牌问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆