无害爬虫是如何绕过 WebForms 身份验证并劫持用户会话的? [英] How did harmless crawler bypass WebForms authentication, and hijack a user's session?

查看:27
本文介绍了无害爬虫是如何绕过 WebForms 身份验证并劫持用户会话的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

昨晚,一位客户疯狂地打来电话,因为 Google 已经缓存了员工私人信息的版本.除非您登录,否则该信息不可用.

Last night a customer called, frantic, because Google had cached versions of private employee information. The information is not available unless you login.

他们已经对他们的域进行了 Google 搜索,例如:

They had done a Google search for their domain, e.g.:

site:example.com

并注意到 Googled 已经抓取并缓存了一些内部页面.

and noticed that Googled had crawled, and cached, some internal pages.

自己查看页面的缓存版本:

Looking at the cached versions of the pages myself:

这是 https://example.com/(F(NSvQJ0SS3gYRJB4UUcDa1z7JWp7Qy7Kb76XGu8riAA1idys-nfR1mid8Qw7sZH0DYcL64GGiB6FK_TLBy3yr0KnARauyjjDL3Wdf1QcS-ivVwWrq-htW_qIeViQlz6CHtm0faD8qVOmAzdArbgngDfMMSg_N4u45UysZxTnL3d6mCX7pe2Ezj0F21g4w9VP57ZlXQ_6Rf-HhK8kMBxEdtlrEm2gBwBhOCcf_f71GdkI1))/ViewTransaction.aspx?transactionNumber=12345 .这是 2013 年 9 月 15 日格林威治标准时间 00:07:22 出现的页面快照

This is Google's cache of https://example.com/(F(NSvQJ0SS3gYRJB4UUcDa1z7JWp7Qy7Kb76XGu8riAA1idys-nfR1mid8Qw7sZH0DYcL64GGiB6FK_TLBy3yr0KnARauyjjDL3Wdf1QcS-ivVwWrq-htW_qIeViQlz6CHtm0faD8qVOmAzdArbgngDfMMSg_N4u45UysZxTnL3d6mCX7pe2Ezj0F21g4w9VP57ZlXQ_6Rf-HhK8kMBxEdtlrEm2gBwBhOCcf_f71GdkI1))/ViewTransaction.aspx?transactionNumber=12345. It is a snapshot of the page as it appeared on 15 Sep 2013 00:07:22 GMT

我被长网址搞糊涂了.而不是:

I was confused by the long url. Rather than:

https://example.com/ViewTransaction.aspx?transactionNumber=12345

插入了一个长字符串:

https://example.com/[...snip...]/ViewTransaction.aspx?transactionNumber=12345

我花了几分钟才记住:这可能是 ASP.net 的无 cookie 会话"症状.如果您的浏览器不支持 Set-Cookie,网站将在 URL 中嵌入 cookie.

It took me a few minutes to remember: that might be a symptom of ASP.net's "cookie-less sessions". If your browser does not support Set-Cookie, the web-site will embed a cookie in the URL.

除非我们的网站不使用它.

Except our site doesn't use that.

即使我们的网站确实自动检测了无 cookie 会话,并且 Google 设法哄骗网络服务器在 url 中将会话交给它,它是如何接管另一个用户的会话?

And even if our site did have cookie-less sessions auto-detected, and Google managed to cajole the web-server into handing it a session in the url, how did it take over another user's session?

该网站已被机器人抓取多年.过去的 5 月 29 日也不例外.

The site has been crawled by bots for years. And this past May 29 was no different.

Google 通常通过检查 robots.txt 文件(我们没有)开始抓取.但是任何人都不能在未经身份验证的情况下准备网站上的任何内容(包括 robots.txt),因此它失败了:

Google usually starts its crawl by checking the robots.txt file (we don't have one). But nobody is allowed to ready anything on the site (including robots.txt) without first being authenticated, so it fails:

Time      Uri                      Port  User Name         Status
========  =======================  ====  ================  ======
1:33:04   GET /robots.txt          80                      302    ;not authenticated, see /Account/Login.aspx
1:33:04   GET /Account/Login.aspx  80                      302    ;use https plesae
1:33:04   GET /Account/Login.aspx  443                     200    ;go ahead, try to login

Google 一直在寻找 robots.txt 文件.它从来没有一个.然后它返回尝试抓取根:

All that time Google was looking for a robots.txt file. It never got one. Then it returns to try to crawl the root:

Time      Uri                      Port  User Name         Status
========  =======================  ====  ================  ======
1:33:04   GET /                    80                      302    ;not authenticated, see /Account/Login.aspx
1:33:04   GET /Account/Login.aspx  80                      302    ;use https plesae
1:33:04   GET /Account/Login.aspx  443                     200    ;go ahead, try to login

在安全站点上再次检查 robots.txt:

And another check of robots.txt on the secure site:

Time      Uri                      Port  User Name         Status
========  =======================  ====  ================  ======
1:33:04   GET /robots.txt          443                     302    ;not authenticated, see /Account/Login.aspx
1:33:04   GET /Account/Login.aspx  443                     200    ;go ahead, try to login

然后是登录页面上的样式表:

And then the stylesheet on the login page:

Time      Uri                      Port  User Name         Status
========  =======================  ====  ================  ======
1:33:04   GET /Styles/Site.css     443                     200    

这就是 GoogleBot、msnbot 和 BingBot 每次抓取的工作原理.机器人,登录,安全,登录.永远不会到达任何地方,因为它无法通过WebForms 身份验证.世界一切都很好.

And that's how every crawl from GoogleBot, msnbot, and BingBot works. Robots, login, secure, login. Never getting anywhere, because it cannot get past WebForms Authentication. And all is well with the world.

直到有一天,GoogleBot 出现,并在手

Until one day, GoogleBot shows up, with a Session cookie in hand!

Time      Uri                        Port  User Name            Status
========  =========================  ====  ===================  ======
1:49:21   GET /                      443   jatwood@example.com  200    ;they showed up logged in!
1:57:35   GET /ControlPanel.aspx     443   jatwood@example.com  200    ;now they're crawling that user's stuff!
1:57:35   GET /Defautl.aspx          443   jatwood@example.com  200    ;back to the homepage
2:07:21   GET /ViewTransaction.aspx  443   jatwood@example.com  200    ;and here comes the private information

用户 jatwood@example.com 已超过一天未登录.(我希望 IIS 为两个同时访问的访问者提供相同的会话标识符,由应用程序回收分开).我们的网站 (web.config) 未配置为启用无会话 cookie.并且服务器 (machine.config) 未配置为启用无会话 cookie.

The user, jatwood@example.com had not been logged in for over a day. (I was hoping that IIS had giving the same session identifier to two simultaneous visitors, separated by an application recycle). And our site (web.config) is not configured to enable session-less cookies. And the server (machine.config) is not configured to enable session-less cookies.

所以:

  • Google 是如何获得无会话 cookie 的?
  • Google 是如何获得有效无会话 cookie 的?
  • Google 如何获得属于另一个用户的有效无会话 cookie?
  • how did Google get ahold of a sessionless cookie?
  • how did Google get ahold of a valid sessionless cookie?
  • how did Google get ahold of a valid sessionless cookie that belonged to another user?

就在 10 月 1 日(4 天前),GoogleBot 还在出现,手里拿着 cookie,以此用户身份登录,抓取、缓存和发布他们的一些私人详细信息.

As recently as October 1 (4 days ago), the GoogleBot was still showing up, cookie in hand, logging in as this user, crawling, caching, and publishing, some of their private details.

Google 如何绕过 WebForms 身份验证的非恶意网络爬虫?

How is Google a non-malicious web-crawler bypassing WebForms authentication?

IIS7,Windows Server 2008 R2,单服务器.

IIS7, Windows Server 2008 R2, single server.

服务器未配置为提供无 cookie 会话.但忽略这一事实,谷歌如何绕过身份验证?

The server is not configured to give out cookieless sessions. But ignoring that fact, how can Google bypass authentication?

  • GoogleBot 正在访问该网站,并尝试随机输入用户名和密码(不太可能,日志显示未尝试登录)
  • GoogleBot 决定在 url 字符串中插入一个随机的 cookieless 会话,它恰好与现有用户的会话匹配(不太可能)
  • 用户设法弄清楚如何让 IIS 网站返回无 cookie 的 url (不太可能),然后将该 URL 粘贴到另一个网站上(不太可能),Google 在这里找到了无 cookie 的 url 并对其进行了抓取
  • 用户通过移动代理运行(他们不是).代理服务器不支持 cookie,因此 IIS 创建一个无 cookie 会话.该(例如 Opera Mobile)缓存服务器遭到破坏(不太可能),并且所有缓存链接都发布在黑客论坛上.GoogleBot 抓取黑客论坛,并开始关注所有链接;包括我们的 jatwood@example.com cookieless session url.
  • 用户感染了病毒,它设法诱使任何 IIS 网络服务器返回一个无 cookie 的 url.然后该病毒会向总部报告.这些 URL 发布到 GoogleBot 抓取的可公开访问的资源上.然后 GoogleBot 会以无 cookie 的网址出现在我们的服务器上.
  • GoogleBot is visting the web-site, and attempting random usernames and passwords (not likely, the logs show no attempts to login)
  • GoogleBot decided to insert a random cookieless session into the url string, and it happened to match the session of an existing user (not likely)
  • The user managed to figure out how to make an IIS web-site return a cookieless url (not likely), then pasted that url onto another web-site (not likely), where Google found the cookieless url and crawled it
  • The user is running through mobile proxy (which they're not). The proxy server doesn't support cookies, so IIS creates a cookieless session. That (e.g. Opera Mobile) caching server was breached (not likely) and all cached links posted on a hacker forum. GoogleBot crawled the hacker forum, and started following all links; including our jatwood@example.com cookieless session url.
  • The user has a virus, which manages to cajole any IIS web-servers into handing back a cookieless url. That virus then reports back to headquarters. The urls are posted onto a publicly accessible resource, that GoogleBot crawl. GoogleBot then shows up at our server with the cookieless url.

这些都不是真的.

Google 一个非恶意的网络爬虫如何绕过 WebForms 身份验证并劫持用户的现有会话?

How can Google a non-malicous web-crawler bypass WebForms authentication, and hijack a user's existing session?

我什至不知道 一个没有配置为提供无 cookie 会话的 ASP.net 网站如何能够提供无 cookie 会话.是否可以将基于cookie的会话ID转换为基于cookie的会话ID?我可以引用 web.configmachine.config 的相关 部分,并显示不存在

I don't even know how an ASP.net web-site, that is not configured to give out cookieless-sessions, could give out cookieless session. Is it possible to back-convert a cookie-based session id into a cookieless-based session id? I could quote the relevant <sessionState> section of web.config and machine.config, and show there is no presence of

<sessionState cookieless="true">

网络服务器如何确定浏览器不支持 cookie?我尝试在 Chrome 中阻止 cookie,但我从未获得过无 cookie 的会话标识符.我可以模拟一个不支持 cookie 的浏览器,以验证我的服务器没有发出无 cookie 的会话吗?

How does the web-server decide that the browser doesn't support cookies? I tried blocking cookies in Chrome, and I was never given a cookie-less session identifier. Can I simulate a browser that doesnt' support cookies, in order to verify that my server is not giving out cookieless sessions?

服务器是否通过User-Agent 字符串决定无cookie 会话?如果是这样,我可以使用欺骗性 UA 设置 Internet Explorer.

Does the server decide cookieless sessions by User-Agent string? If so, I could set Internet Explorer with a spoofed UA.

ASP.net 中的会话标识是否仅依赖于 cookie?任何人都可以从任何 IP 使用 cookie-url 访问该会话吗?默认情况下,ASP.net 不也考虑在内吗?

Does session identity in ASP.net depend solely on the cookie? Can anyone, from any IP, with the cookie-url, access that session? Does ASP.net not, by default, also take into account?

如果 ASP.net 确实将 IP 地址与会话相关联,那岂不是意味着会话不可能源自员工在家中的计算机上?因为当 GoogleBot 抓取工具尝试从 Google IP 使用它时,它会失败吗?

If ASP.net does tie IP address with the session, wouldn't that mean that the session couldn't have originated from the employee at their home computer? Because then when the GoogleBot crawler tried to use it from a Google IP, it would have failed?

在任何地方(除了我链接的那个)是否有任何 ASP.net 在未配置时发出 cookieless 会话的实例?是否存在 Microsoft Connect 问题?

Has there been any instances anywhere (besides the one I linked) of ASP.net giving out cookieless sessions when it's not configured to? Is there a Microsoft Connect issue on this?

Web-Forms 身份验证是否已知存在问题,不应用于安全?

Is Web-Forms authentication known to have issues, and should not be used to security?

编辑:删除了绕过特权的机器人Google 的名称,因为人们都是笨蛋;混淆 Google 抓取工具的名称用于其他内容.我使用 Google 抓取工具的名称来提醒它是一个非恶意的网络抓取工具,它设法将其抓取到另一个用户的 WebForm 会话中.这是为了将其与试图闯入另一个用户会话的恶意爬虫进行对比.没有什么比学究气更能激怒了.

Edit: Removed name of Google the bot that bypassed privilege, as people are pants on head retarded; confusing Google the name of the crawler for something else. I use Google the name of the crawler as a reminder that it was a non-malicious web-crawler that managed to crawl it's way into another user's WebForm's session. This is to contrast it with a malicious crawler, that was trying to break into another user's session. Nothing like a pedant to bring out the aggravation.

推荐答案

虽然这个问题主要涉及会话标识符,但标识符的长度让我觉得不寻常.

Though the question mainly references session identifiers, the length of the identifier struck me as unusual.

至少有两种 cookie/cookieless 操作可以修改查询字符串以包含 ID.

There are at least two types of cookie/cookieless operations that can modify the query string to include an ID.

  • 无 Cookie 会话
  • 无 Cookie 表单身份验证令牌

它们彼此完全独立(据我所知).

They are completely independent of each other (as far as I can tell).

无 cookie 会话允许服务器根据 URL 中的唯一 ID 和 cookie 中的唯一 ID 访问会话状态数据.这通常被认为是一种很好的做法,尽管 ASP.Net 重用了会话 ID,这使得它更容易进行会话固定尝试(单独的主题但值得了解).

A cookieless session allows the server to access session state data based on a unique ID in the URL versus a unique ID in a cookie. This is usually considered a fine practice, though ASP.Net reuses session IDs which makes it more prone to session fixation attempts (separate topic but worth knowing about).

ASP.net 中的会话标识是否仅依赖于 cookie?能任何人,来自任何 IP,使用 cookie-url,访问该会话?做ASP.net 不是,默认情况下也考虑?

Does session identity in ASP.net depend solely on the cookie? Can anyone, from any IP, with the cookie-url, access that session? Does ASP.net not, by default, also take into account?

只需要会话 ID.

一般会话安全阅读

根据示例数据的长度,我猜测您的 URL 实际上包含表单身份验证值,而不是会话 ID.源代码表明 cookieless 模式不是您必须明确启用的.

Based on the length of the example data, I'm guessing your URL actually contains a forms authentication value, not a session ID. The source code suggests that cookieless mode is not something you must explicitly enable.

/// <summary>ASP.NET determines whether to use cookies based on
/// <see cref="T:System.Web.HttpBrowserCapabilities" /> setting. 
/// If the setting indicates that the browser or device supports cookies, 
/// cookies are used; otherwise, an identifier is used in the query string.</summary>
UseDeviceProfile

以下是确定的方式:

// System.Web.Security.CookielessHelperClass
internal static bool UseCookieless( HttpContext context, bool doRedirect, HttpCookieMode cookieMode )
{
    switch( cookieMode )
    {
        case HttpCookieMode.UseUri:
            return true;
        case HttpCookieMode.UseCookies:
            return false;
        case HttpCookieMode.AutoDetect:
            {
                // omitted for length
                return false;
            }
        case HttpCookieMode.UseDeviceProfile:
            if( context == null )
            {
                context = HttpContext.Current;
            }
            return context != null && ( !context.Request.Browser.Cookies || !context.Request.Browser.SupportsRedirectWithCookie );
        default:
            return false;
    }
}

猜猜默认是什么?HttpCookieMode.UseDeviceProfile.ASP.Net 维护一个设备和功能列表.这份清单通常是一件非常糟糕的事情;对于例如,IE11 误报为低级浏览器 与 Netscape 4 相当.

Guess what the default is? HttpCookieMode.UseDeviceProfile. ASP.Net maintains a list of devices and capabilities. This list is generally a very bad thing; for example, IE11 gives a false positive for being a downlevel browser on par with Netscape 4.

我认为Gene的解释很有可能;Google 通过某些用户操作找到了该网址并对其进行了抓取.

I think Gene's explanation is very likely; Google found the URL from some user action and crawled it.

完全可以想象,Google bot 被认为不支持 cookie.但这并不能解释 URL 的来源,即什么用户操作导致 Google 看到一个带有 ID 的 URL?一个简单的解释可能是用户的浏览器被认为不支持 cookie.根据浏览器的不同,其他一切对用户来说都很好.

It's completely conceivable that the Google bot is deemed to not support cookies. But this doesn't explain the origin of the URL, i.e. what user action resulted in Google seeing a URL with an ID already in it? A simple explanation could be a user with a browser that was deemed to not support cookies. Depending on the browser, everything else could look fine to the user.

时间,即有效期似乎很长,虽然我不太熟悉身份验证票的有效期以及在什么情况下可以续签.ASP.Net 完全有可能继续为持续活跃的用户重新发行/续订门票.

The timing, i.e. the duration of validity seems long, though I'm not that familiar with how long the authentication tickets are valid and under what circumstances they could be renewed. It's entirely possible ASP.Net continued to reissue/renew tickets as it would do for a continually active user.

我在这里做了很多假设,但如果我是对的:

I'm making a lot of assumptions here, but If I'm correct:

  • 首先,在您的环境中重现行为.
  • 使用 HttpCookieMode.UseCookies 显式禁用无 cookie 行为.

  • First, reproduce the behavior in your environment.
  • Explicitly disable cookieless behavior by using HttpCookieMode.UseCookies.

web.config:

 <authentication mode="Forms">
    <forms loginUrl="~/Account/Login.aspx" name=".ASPXFORMSAUTH" timeout="26297438"
           cookieless="UseCookies" />
 </authentication>

虽然这应该可以解决该行为,但您可以研究扩展表单身份验证 HTTP 模块并添加额外的验证(或至少日志记录/诊断).

While this should resolve the behavior, you might investigate extending the forms authentication HTTP module and adding additional validation (or at least logging/diagnostics).

这篇关于无害爬虫是如何绕过 WebForms 身份验证并劫持用户会话的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆