为什么Python说这个Netscape cookie文件是无效的? [英] Why does Python say this Netscape cookie file isn't valid?

查看:3635
本文介绍了为什么Python说这个Netscape cookie文件是无效的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在撰写 Google学术搜索解析器,并基于这个回答,我在设置cookie之前抓住HTML。这是我的 cookies.txt 文件的内容:

 #Netscape HTTP Cookie文件
#http://curlm.haxx.se/rfc/cookie_spec.html
#此文件由libcurl生成!编辑自行承担风险。

.scholar.google.com TRUE / FALSE 2147483647 GSP ID = 353e8f974d766dcd:CF = 2
.google.com TRUE / FALSE 1317124758 PREF ID = 353e8f974d766dcd:TM = 1254052758:LM = 1254052758 :S = _biVh02e4scrJT1H
.scholar.google.co.uk TRUE / FALSE 2147483647 GSP ID = f3f18b3b5a7c2647:CF = 2
.google.co.uk TRUE / FALSE 1317125123 PREF ID = f3f18b3b5a7c2647:TM = 1254053123 :LM = 1254053123:S = UqjRcTObh7_sARkN

这是我用来抓取HTML :

  import http.cookiejar 
import urllib.request,urllib.parse,urllib.error

def get_page(url,headers =,params =):
filename =cookies.txt
request = urllib.request.Request(url,None,headers,params)
cookies = http.cookiejar.MozillaCookieJar(filename,None,None)
cookie.load()
cookie_handler = urllib.request.HTTPCookieProcessor(cookies)
redirect_handler = urllib.request .HTTPRedirectHandler()
opener = urllib.request.build_opener(redirect_handler,cookie_handler)
response = opener.open(request)
返回响应

start = 0
search =Ricardo Altamirano
results_per_fetch = 20
host =http://scholar.google.com
base_url =/ scholar
headers = { 'User-Agent':'Mozilla / 5.0(Windows NT 6.1; U; ; rv:5.0.1.6)Gecko / 20110501 Firefox / 5.0.1 Firefox / 5.0.1'}
params = urllib.parse.urlencode({'start':start,
'q':' '+ search +'',
'btnG':,
'hl':'en',
'num':results_per_fetch,
'as_sdt' 1,14'})

url = base_url +? + params
resp = get_page(host + url,headers,params)

是:

 跟踪(最近一次调用):
文件C:/ Users / ricardo / Desktop / -Scholar / BibTex / test.py,第29行,在< module>
resp = get_page(host + url,headers,params)
文件C:/Users/ricardo/Desktop/Google-Scholar/BibTex/test.py,第8行,在get_page
cookies.load()
文件C:\Python32\lib\http\cookiejar.py,行1767,在load
self._really_load(f,filename,ignore_discard,ignore_expires )
文件C:\Python32\lib\http\cookiejar.py,第1997行,在_really_load
文件名中)
http.cookiejar.LoadError:'cookies.txt '看起来不像Netscape格式的cookie文件

我已经找到了关于Netscape cookie的文档文件格式,但我找不到任何显示我的问题。是否需要包含换行符?我把行结束改为Unix风格,以防万一,但是没有解决问题。我可以找到的最接近的规格是, t对我说什么,我失踪。

解决方案

最后四行的每一行的字段由制表符而不是空格分隔,我没有看到您的示例代码或cookies.txt文件的副本显然是错误的。



我已经检查了源代码 MozillaCookieJar._really_load 方法

这个方法的第一件事是读取您指定的文件的第一行(使用 f.readline())并使用 re.search 查找正则表达式模式#(Netscape)?HTTP Cookie文件

cookies.txt >将匹配该格式,因此您看到的错误是非常令人惊讶的。



请注意,您的文件用一个简单的 打开(文件名)调用,因此它将在文本模式下使用通用行结束支持打开,这意味着在Windows上运行此操作并不重要。代码将看到 \\\
换行符终止的字符串,无论文件本身使用什么换行约定。



在这种情况下我会做的是三重检查,你的文件的第一行是真的正确。它需要包含#HTTP Cookie文件或#Netscape HTTP Cookie文件(空格,单词之间没有制表符,大写匹配)。用python提示符测试这个:

 >>> f = open('cookies.txt')
>>> line = f.readline()
>>>> line
'#Netscape HTTP Cookie File\\\
'
>>>> import re
>>>> re.search(#(Netscape)?HTTP Cookie File,line)
< _sre.SRE_Match object at 0x10fecfdc8>

当我输入,包括 \\\
换行符。任何惊喜,如制表符字符或unicode零宽度空格将显示为转义码。我还验证了 cookiejar 代码使用的正则表达式匹配。



您还可以使用 pdb python调试器以验证 http.cookiejar 模块确实:

 >> import pdb 
>>>> import http.cookiejar
>>> jar = http.cookiejar.MozillaCookieJar('cookies.txt')
>>> pdb.run('jar.load()')
> < string>(1)< module>()
(Pdb)s
--Call--
& /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1759)load()
- > def load(self,filename = None,ignore_discard = False,ignore_expires = False):
(Pdb)s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1761)load()
- >如果filename是None:
(Pdb)s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1762)load()
- >如果self.filename不是无:filename = self.filename
(Pdb)s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1765)load()
- > f = open(filename)
(Pdb)n
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1766)load()
- > try:
(Pdb)
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1767)load()
- > self._really_load(f,filename,ignore_discard,ignore_expires)
(Pdb)s
--Call--
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1989)_really_load()
- > def _really_load(self,f,filename,ignore_discard,ignore_expires):
(Pdb)s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1990)_really_load()
- > now = time.time()
(Pdb)n
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1992)_really_load()
- > magic = f.readline()
(Pdb)
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1993)_really_load()
- > if not self.magic_re.search(magic):
(Pdb)
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1999)_really_load()
- >尝试:

在上面的示例pdb会话中,我使用步骤和命令来验证正则表达式测试( self.magic_re.search c>)实际通过。


I'm writing a Google Scholar parser, and based on this answer, I'm setting cookies before grabbing the HTML. This is the contents of my cookies.txt file:

# Netscape HTTP Cookie File
# http://curlm.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.

.scholar.google.com     TRUE    /       FALSE   2147483647      GSP     ID=353e8f974d766dcd:CF=2
.google.com     TRUE    /       FALSE   1317124758      PREF    ID=353e8f974d766dcd:TM=1254052758:LM=1254052758:S=_biVh02e4scrJT1H
.scholar.google.co.uk   TRUE    /       FALSE   2147483647      GSP     ID=f3f18b3b5a7c2647:CF=2
.google.co.uk   TRUE    /       FALSE   1317125123      PREF    ID=f3f18b3b5a7c2647:TM=1254053123:LM=1254053123:S=UqjRcTObh7_sARkN

and this is the code I'm using to grab the HTML:

import http.cookiejar
import urllib.request, urllib.parse, urllib.error

def get_page(url, headers="", params=""):
    filename = "cookies.txt"
    request = urllib.request.Request(url, None, headers, params)
    cookies = http.cookiejar.MozillaCookieJar(filename, None, None)
    cookies.load()
    cookie_handler = urllib.request.HTTPCookieProcessor(cookies)
    redirect_handler = urllib.request.HTTPRedirectHandler()
    opener = urllib.request.build_opener(redirect_handler,cookie_handler)
    response = opener.open(request)
    return response

start = 0
search = "Ricardo Altamirano"
results_per_fetch = 20
host = "http://scholar.google.com"
base_url = "/scholar"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; U; ru; rv:5.0.1.6) Gecko/20110501 Firefox/5.0.1 Firefox/5.0.1'}
params = urllib.parse.urlencode({'start' : start,
                                 'q': '"' + search + '"',
                                 'btnG' : "",
                                 'hl' : 'en',
                                 'num': results_per_fetch,
                                 'as_sdt' : '1,14'})

url = base_url + "?" + params
resp = get_page(host + url, headers, params)

The full traceback is:

Traceback (most recent call last):
  File "C:/Users/ricardo/Desktop/Google-Scholar/BibTex/test.py", line 29, in <module>
    resp = get_page(host + url, headers, params)
  File "C:/Users/ricardo/Desktop/Google-Scholar/BibTex/test.py", line 8, in get_page
    cookies.load()
  File "C:\Python32\lib\http\cookiejar.py", line 1767, in load
    self._really_load(f, filename, ignore_discard, ignore_expires)
  File "C:\Python32\lib\http\cookiejar.py", line 1997, in _really_load
    filename)
http.cookiejar.LoadError: 'cookies.txt' does not look like a Netscape format cookies file

I've looked around for documentation on the Netscape cookie file format, but I can't find anything that shows me the problem. Are there newlines that need to be included? I changed the line endings to Unix style, just in case, but that didn't solve the problem. The closest specification I can find is this, which doesn't indicate anything to me that I'm missing. The fields on each of the last four lines are separated by tabs, not spaces, and everything else looks correct to me.

解决方案

I see nothing in your example code or copy of the cookies.txt file that is obviously wrong.

I've checked the source code for the MozillaCookieJar._really_load method, which throws the exception that you see.

The first thing this method does, is read the first line of the file you specified (using f.readline()) and use re.search to look for the regular expression pattern "#( Netscape)? HTTP Cookie File". This is what fails for your file.

It certainly looks like your cookies.txt would match that format, so the error you see is quite surprising.

Note that your file is opened with a simple open(filename) call earlier on, so it'll be opened in text mode with universal line ending support, meaning it doesn't matter that you are running this on Windows. The code will see \n newline terminated strings, regardless of what newline convention was used in the file itself.

What I'd do in this case is triple-check that your file's first line is really correct. It needs to either contain "# HTTP Cookie File" or "# Netscape HTTP Cookie File" (spaces only, no tabs, between the words, capitalisation matching). Test this with the python prompt:

>>> f = open('cookies.txt')
>>> line = f.readline()
>>> line
'# Netscape HTTP Cookie File\n'
>>> import re
>>> re.search("#( Netscape)? HTTP Cookie File", line)
<_sre.SRE_Match object at 0x10fecfdc8>

Python echoed the line representation back to me when I typed line at the prompt, including the \n newline character. Any surprises like tab characters or unicode zero-width spaces will show up there as escape codes. I also verified that the regular expression used by the cookiejar code matches.

You can also use the pdb python debugger to verify what the http.cookiejar module really does:

>>> import pdb
>>> import http.cookiejar
>>> jar = http.cookiejar.MozillaCookieJar('cookies.txt')
>>> pdb.run('jar.load()')
> <string>(1)<module>()
(Pdb) s
--Call--
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1759)load()
-> def load(self, filename=None, ignore_discard=False, ignore_expires=False):
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1761)load()
-> if filename is None:
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1762)load()
-> if self.filename is not None: filename = self.filename
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1765)load()
-> f = open(filename)
(Pdb) n
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1766)load()
-> try:
(Pdb) 
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1767)load()
-> self._really_load(f, filename, ignore_discard, ignore_expires)
(Pdb) s
--Call--
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1989)_really_load()
-> def _really_load(self, f, filename, ignore_discard, ignore_expires):
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1990)_really_load()
-> now = time.time()
(Pdb) n
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1992)_really_load()
-> magic = f.readline()
(Pdb) 
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1993)_really_load()
-> if not self.magic_re.search(magic):
(Pdb) 
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1999)_really_load()
-> try:

In the above sample pdb session I used a combination of the step and next commands to verify that the regular expression test (self.magic_re.search(magic)) actually passed.

这篇关于为什么Python说这个Netscape cookie文件是无效的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆