路径的CookieContainer处理(谁吃了我的饼干吗?) [英] CookieContainer handling of paths (Who ate my cookie?)

查看:205
本文介绍了路径的CookieContainer处理(谁吃了我的饼干吗?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的工作,涉及到一些基本的网页抓取的项目。我一直在使用HttpWebRequest和HttpWebResponse相当成功。对于cookie处理我只有一个的CookieContainer,我分配每次HttpWebRequest.CookieContainer。我自动获取,每次填充了新的cookies,不需要额外的处理我。这是过去一直工作正常,直到前一阵,当那曾经工作过的网站之一突然停止工作。我有理由相信这是与饼干有问题,但我没有保持距离,当它使用的饼干的记录工作,所以我不是100%肯定。

我已经成功地模拟了问题,因为我看到它与下面的code:

 的CookieContainer cookieJar =新的CookieContainer();

乌里uri1 =新的URI(http://www.somedomain.com/some/path/page1.html);
CookieCollection cookies1 =新CookieCollection();
cookies1.Add(新的Cookie(NoPathCookie,Page1Value));
cookies1.Add(新的Cookie(CookieWithPath,Page1Value,/一些/路径/));

乌里uri2 =新的URI(http://www.somedomain.com/some/path/page2.html);
CookieCollection cookies2 =新CookieCollection();
cookies2.Add(新的Cookie(NoPathCookie,Page2Value));
cookies2.Add(新的Cookie(CookieWithPath,Page2Value,/一些/路径/));

乌里uri3 =新的URI(http://www.somedomain.com/some/path/page3.html);

//从page1.html添加饼干
cookieJar.Add(uri1,cookies1);

//从page2.html添加饼干
cookieJar.Add(uri2,cookies2);

//我们现在应该有3饼干
Console.WriteLine(的String.Format(CookieJar包含{0}曲奇,cookieJar.Count));

Console.WriteLine(的String.Format(饼干送page1.html:{0},cookieJar.GetCookieHeader(uri1)));
Console.WriteLine(的String.Format(饼干送page2.html:{0},cookieJar.GetCookieHeader(uri2)));
Console.WriteLine(的String.Format(饼干送page3.html:{0},cookieJar.GetCookieHeader(uri3)));
 

此模拟来访两页,这两者都设置两个饼干。然后检查其这些cookie将被设置为每三页。

两个cookie,一个是设置而不指定路径和其他具有指定的路径。如果未指定路径,我曾以为,该cookie将被送回该域的任何网页,但它似乎只送回到那个特定的页面。我现在假设是正确的,因为它是一贯的。

有我的主要问题是饼干,如果指定了路径的处理。确实地,如果指定路径然后该cookie应该被发送到包含在该路径内的任何页面。因此,在code以上,CookieWithPath'应该是有效的任何页面内/一些/路径/,其中包括page1.html,page2.html和page3.html。当然,如果你注释掉两个NoPathCookie的实例,那么'CookieWithPath被发送到所有三页,因为我期望的那样。然而,同时列入'NoPathCookie'如上述,然后'CookieWithPath'只被发送到page2.html和page3.html,但不是page1.html

这是为什么,是不是正确的?

在搜索这个问题,我所遇到的讨论有关在的CookieContainer域处理的一个问题,但一直没能找到路径处理任何讨论。

我在使用Visual Studio 2005 / .NET 2.0

解决方案
  

在未指定路径,我曾以为,该cookie将被送回该域的任何网页,但它似乎只送回到那个特定的页面。我现在假设是正确的,因为它是一贯的。

是的,这是正确的。每当未指定的域或路径,它采取从目前的URI。

好了,让我们来看看的CookieContainer。所讨论的方法是<一个href="http://typedescriptor.net/browse/members/324930-System.Net.CookieContainer.InternalGetCookies%28Uri%29"相对=nofollow> InternalGetCookies(URI)。这里有一个有趣的现象:

 而(enumerator2.MoveNext())
{
    的DictionaryEntry的DictionaryEntry =(的DictionaryEntry)enumerator2.get_Current();
    字符串文本2 =(字符串)dictionaryEntry.get_Key();
    如果(!uri.AbsolutePath.StartsWith(CookieParser.CheckQuoted(文本2)))
    {
        如果(FLAG2)
        {
            打破;
        }
        其他
        {
            继续;
        }
    }
    FLAG2 = TRUE;
    CookieCollection cookieCollection2 =(CookieCollection)dictionaryEntry.get_Value();
    cookieCollection2.TimeStamp(CookieCollection.Stamp.Set);
    this.MergeUpdateCollections(cookieCollection,cookieCollection2,港口,标志,我℃下);
    如果(!(文本2 ==/))
    {
        继续;
    }
    相当于Flag3 = TRUE;
    继续;
}
 

enumerator2 这里是饼干'路径(排序)名单。它是按这样的方式,更具体的路径(如 /目录/子目录/ )之前不太具体的去(如 /目录/ ),否则 - 在字典顺序(<​​code> /目录/第1页进入 /目录/ 2页前

在code不实际以下:它遍历饼干的路径列表中,直到它找到的第一路径,这是一个preFIX所请求的URI的路径。然后,它增加了一个曲奇饼该路径下的输出和集 FLAG2 ,意思是好了,我终于发现地点在列表中,实际上涉及到请求的URI。在此之后,第一次见面的路径,这不是一个preFIX所请求的URI的路径被认为是相关路径的结束,因此code停止搜索饼干做破解

显然,这是某种形式的优化,以prevent扫描整个表的,它显然是工作,如果没有路径,导致具体的页面。现在,你的情况下,路径列表看起来像这样:

  /some/path/page1.html
/some/path/page2.html
/一些/路径/
 

您可以检查与调试,查找((System.Net.PathList)(cookieJar.m_domainTable [www.somedomain.com]))。m_list 在监视窗口

所以,对于page1.htmlURI,在 page2.html 的项目,没有一个机会,还可以处理 /一些/路径/ 项。

结论:这是显然的CookieContainer另一个错误。我认为应该在连接的报道。

PS:这是每一个类太多的错误。我只希望家伙MS是谁写测试这个类已经解雇了。

I'm working on a project that involves some basic web crawling. I've been using HttpWebRequest and HttpWebResponse quite successfully. For cookie handling I just have one CookieContainer that I assign to HttpWebRequest.CookieContainer each time. I automatically gets populated with the new cookies each time and requires no additional handling from me. This has all been working fine until a little while ago when one of the web sites that used to work suddenly stopped working. I'm reasonably sure it's a problem with the cookies, but I didn't keep a record of the cookies from when it used to work so I'm not 100% sure.

I've managed to simulate the issue as I see it with the following code:

CookieContainer cookieJar = new CookieContainer();

Uri uri1 = new Uri("http://www.somedomain.com/some/path/page1.html");
CookieCollection cookies1 = new CookieCollection();
cookies1.Add(new Cookie("NoPathCookie", "Page1Value"));
cookies1.Add(new Cookie("CookieWithPath", "Page1Value", "/some/path/"));

Uri uri2 = new Uri("http://www.somedomain.com/some/path/page2.html");
CookieCollection cookies2 = new CookieCollection();
cookies2.Add(new Cookie("NoPathCookie", "Page2Value"));
cookies2.Add(new Cookie("CookieWithPath", "Page2Value", "/some/path/"));

Uri uri3 = new Uri("http://www.somedomain.com/some/path/page3.html");

// Add the cookies from page1.html
cookieJar.Add(uri1, cookies1);

// Add the cookies from page2.html
cookieJar.Add(uri2, cookies2);

// We should now have 3 cookies
Console.WriteLine(string.Format("CookieJar contains {0} cookies", cookieJar.Count));

Console.WriteLine(string.Format("Cookies to send to page1.html: {0}", cookieJar.GetCookieHeader(uri1)));
Console.WriteLine(string.Format("Cookies to send to page2.html: {0}", cookieJar.GetCookieHeader(uri2)));
Console.WriteLine(string.Format("Cookies to send to page3.html: {0}", cookieJar.GetCookieHeader(uri3)));

This simulates visiting two pages, both of which set two cookies. It then checks which of those cookies would be set to each of three pages.

Of the two cookies, one is set without specifying a path and the other has a path specified. When a path is not specified, I had assumed that the cookie would be sent back to any page in that domain, but it seems to only get sent back to that specific page. I'm now assuming that is correct as it is consistent.

The main problem for me is the handling of cookies with a path specified. Surely, if a path is specified then the cookie should be sent to any page contained within that path. So, in the code above, 'CookieWithPath' should be valid for any page within /some/path/, which includes page1.html, page2.html and page3.html. Certainly if you comment out the two 'NoPathCookie' instances, then the 'CookieWithPath' gets sent to all three pages as I would expect. However, with the inclusion of 'NoPathCookie' as above, then 'CookieWithPath' only gets sent to page2.html and page3.html, but not page1.html.

Why is this, and is it correct?

Searching for this issue I have come across discussion about a problem with domain handling in CookieContainer, but have not been able to find any discussion about path handling.

I'm using Visual Studio 2005 / .NET 2.0

解决方案

When a path is not specified, I had assumed that the cookie would be sent back to any page in that domain, but it seems to only get sent back to that specific page. I'm now assuming that is correct as it is consistent.

Yep, that's correct. Whenever domain or path is not specified, it's taken from current URI.

OK, let's take a look at CookieContainer. The method in question is InternalGetCookies(Uri). Here's the interesting part:

while (enumerator2.MoveNext())
{
    DictionaryEntry dictionaryEntry = (DictionaryEntry)enumerator2.get_Current();
    string text2 = (string)dictionaryEntry.get_Key();
    if (!uri.AbsolutePath.StartsWith(CookieParser.CheckQuoted(text2)))
    {
        if (flag2)
        {
            break;
        }
        else
        {
            continue;
        }
    }
    flag2 = true;
    CookieCollection cookieCollection2 = (CookieCollection)dictionaryEntry.get_Value();
    cookieCollection2.TimeStamp(CookieCollection.Stamp.Set);
    this.MergeUpdateCollections(cookieCollection, cookieCollection2, port, flag, i < 0);
    if (!(text2 == "/"))
    {
        continue;
    }
    flag3 = true;
    continue;
}

enumerator2 here is a (sorted) list of cookies' paths. It is sorted in such a way, that more specific paths (like /directory/subdirectory/) go before less specific ones (like /directory/), and otherwise - in lexicographical order (/directory/page1 goes before /directory/page2).

The code does actually the following: it iterates over this list of cookies' paths until it finds a first path, that is a prefix for requested URI's path. Then it adds a cookies under that path to the output and sets flag2 to true, which means "OK, I finally found the place in the list that actually relate to requested URI". After that, the first met path, that is NOT a prefix for requested URI's path is considered to be the end of related paths, so the code stops searching for cookies by doing break.

Obviously, this is some kind of optimization to prevent scanning the whole list and it apparently works if none of paths leads to concrete page. Now, for your case, the path list looks like that:

/some/path/page1.html
/some/path/page2.html
/some/path/

You can check that with a debugger, looking up ((System.Net.PathList)(cookieJar.m_domainTable["www.somedomain.com"])).m_list in watch window

So, for 'page1.html' URI, the code breaks on page2.html item, not having a chance to process also /some/path/ item.

In conclusion: this is obviously yet another bug in CookieContainer. I believe it should be reported on connect.

PS: That's too many bugs per one class. I only hope the guy at MS who wrote tests for this class is already fired.

这篇关于路径的CookieContainer处理(谁吃了我的饼干吗?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆