为什么Python说这个Netscape cookie文件是无效的? [英] Why does Python say this Netscape cookie file isn't valid?
问题描述
我在撰写 Google学术搜索解析器,并基于这个回答,我在设置cookie之前抓住HTML。这是我的 cookies.txt
文件的内容:
#Netscape HTTP Cookie文件
#http://curlm.haxx.se/rfc/cookie_spec.html
#此文件由libcurl生成!编辑自行承担风险。
.scholar.google.com TRUE / FALSE 2147483647 GSP ID = 353e8f974d766dcd:CF = 2
.google.com TRUE / FALSE 1317124758 PREF ID = 353e8f974d766dcd:TM = 1254052758:LM = 1254052758 :S = _biVh02e4scrJT1H
.scholar.google.co.uk TRUE / FALSE 2147483647 GSP ID = f3f18b3b5a7c2647:CF = 2
.google.co.uk TRUE / FALSE 1317125123 PREF ID = f3f18b3b5a7c2647:TM = 1254053123 :LM = 1254053123:S = UqjRcTObh7_sARkN
这是我用来抓取HTML :
import http.cookiejar
import urllib.request,urllib.parse,urllib.error
def get_page(url,headers =,params =):
filename =cookies.txt
request = urllib.request.Request(url,None,headers,params)
cookies = http.cookiejar.MozillaCookieJar(filename,None,None)
cookie.load()
cookie_handler = urllib.request.HTTPCookieProcessor(cookies)
redirect_handler = urllib.request .HTTPRedirectHandler()
opener = urllib.request.build_opener(redirect_handler,cookie_handler)
response = opener.open(request)
返回响应
start = 0
search =Ricardo Altamirano
results_per_fetch = 20
host =http://scholar.google.com
base_url =/ scholar
headers = { 'User-Agent':'Mozilla / 5.0(Windows NT 6.1; U; ; rv:5.0.1.6)Gecko / 20110501 Firefox / 5.0.1 Firefox / 5.0.1'}
params = urllib.parse.urlencode({'start':start,
'q':' '+ search +'',
'btnG':,
'hl':'en',
'num':results_per_fetch,
'as_sdt' 1,14'})
url = base_url +? + params
resp = get_page(host + url,headers,params)
是:
跟踪(最近一次调用):
文件C:/ Users / ricardo / Desktop / -Scholar / BibTex / test.py,第29行,在< module>
resp = get_page(host + url,headers,params)
文件C:/Users/ricardo/Desktop/Google-Scholar/BibTex/test.py,第8行,在get_page
cookies.load()
文件C:\Python32\lib\http\cookiejar.py,行1767,在load
self._really_load(f,filename,ignore_discard,ignore_expires )
文件C:\Python32\lib\http\cookiejar.py,第1997行,在_really_load
文件名中)
http.cookiejar.LoadError:'cookies.txt '看起来不像Netscape格式的cookie文件
我已经找到了关于Netscape cookie的文档文件格式,但我找不到任何显示我的问题。是否需要包含换行符?我把行结束改为Unix风格,以防万一,但是没有解决问题。我可以找到的最接近的规格是这, t对我说什么,我失踪。
最后四行的每一行的字段由制表符而不是空格分隔,我没有看到您的示例代码或cookies.txt文件的副本显然是错误的。
我已经检查了源代码 MozillaCookieJar._really_load
方法
这个方法的第一件事是读取您指定的文件的第一行(使用 f.readline()
)并使用 re.search
查找正则表达式模式#(Netscape)?HTTP Cookie文件
。
cookies.txt
>将匹配该格式,因此您看到的错误是非常令人惊讶的。 请注意,您的文件用一个简单的 打开(文件名)
调用,因此它将在文本模式下使用通用行结束支持打开,这意味着在Windows上运行此操作并不重要。代码将看到 \\\
换行符终止的字符串,无论文件本身使用什么换行约定。
在这种情况下我会做的是三重检查,你的文件的第一行是真的正确。它需要包含#HTTP Cookie文件或#Netscape HTTP Cookie文件(空格,单词之间没有制表符,大写匹配)。用python提示符测试这个:
>>> f = open('cookies.txt')
>>> line = f.readline()
>>>> line
'#Netscape HTTP Cookie File\\\
'
>>>> import re
>>>> re.search(#(Netscape)?HTTP Cookie File,line)
< _sre.SRE_Match object at 0x10fecfdc8>
当我输入行$时, c $ c>,包括
\\\
换行符。任何惊喜,如制表符字符或unicode零宽度空格将显示为转义码。我还验证了
cookiejar
代码使用的正则表达式匹配。
您还可以使用 pdb
python调试器以验证 http.cookiejar
模块确实:
>> import pdb
>>>> import http.cookiejar
>>> jar = http.cookiejar.MozillaCookieJar('cookies.txt')
>>> pdb.run('jar.load()')
> < string>(1)< module>()
(Pdb)s
--Call--
& /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1759)load()
- > def load(self,filename = None,ignore_discard = False,ignore_expires = False):
(Pdb)s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1761)load()
- >如果filename是None:
(Pdb)s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1762)load()
- >如果self.filename不是无:filename = self.filename
(Pdb)s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1765)load()
- > f = open(filename)
(Pdb)n
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1766)load()
- > try:
(Pdb)
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1767)load()
- > self._really_load(f,filename,ignore_discard,ignore_expires)
(Pdb)s
--Call--
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1989)_really_load()
- > def _really_load(self,f,filename,ignore_discard,ignore_expires):
(Pdb)s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1990)_really_load()
- > now = time.time()
(Pdb)n
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1992)_really_load()
- > magic = f.readline()
(Pdb)
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1993)_really_load()
- > if not self.magic_re.search(magic):
(Pdb)
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1999)_really_load()
- >尝试:
在上面的示例pdb会话中,我使用步骤和
下
命令来验证正则表达式测试( self.magic_re.search c>)实际通过。
I'm writing a Google Scholar parser, and based on this answer, I'm setting cookies before grabbing the HTML. This is the contents of my cookies.txt
file:
# Netscape HTTP Cookie File
# http://curlm.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
.scholar.google.com TRUE / FALSE 2147483647 GSP ID=353e8f974d766dcd:CF=2
.google.com TRUE / FALSE 1317124758 PREF ID=353e8f974d766dcd:TM=1254052758:LM=1254052758:S=_biVh02e4scrJT1H
.scholar.google.co.uk TRUE / FALSE 2147483647 GSP ID=f3f18b3b5a7c2647:CF=2
.google.co.uk TRUE / FALSE 1317125123 PREF ID=f3f18b3b5a7c2647:TM=1254053123:LM=1254053123:S=UqjRcTObh7_sARkN
and this is the code I'm using to grab the HTML:
import http.cookiejar
import urllib.request, urllib.parse, urllib.error
def get_page(url, headers="", params=""):
filename = "cookies.txt"
request = urllib.request.Request(url, None, headers, params)
cookies = http.cookiejar.MozillaCookieJar(filename, None, None)
cookies.load()
cookie_handler = urllib.request.HTTPCookieProcessor(cookies)
redirect_handler = urllib.request.HTTPRedirectHandler()
opener = urllib.request.build_opener(redirect_handler,cookie_handler)
response = opener.open(request)
return response
start = 0
search = "Ricardo Altamirano"
results_per_fetch = 20
host = "http://scholar.google.com"
base_url = "/scholar"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; U; ru; rv:5.0.1.6) Gecko/20110501 Firefox/5.0.1 Firefox/5.0.1'}
params = urllib.parse.urlencode({'start' : start,
'q': '"' + search + '"',
'btnG' : "",
'hl' : 'en',
'num': results_per_fetch,
'as_sdt' : '1,14'})
url = base_url + "?" + params
resp = get_page(host + url, headers, params)
The full traceback is:
Traceback (most recent call last):
File "C:/Users/ricardo/Desktop/Google-Scholar/BibTex/test.py", line 29, in <module>
resp = get_page(host + url, headers, params)
File "C:/Users/ricardo/Desktop/Google-Scholar/BibTex/test.py", line 8, in get_page
cookies.load()
File "C:\Python32\lib\http\cookiejar.py", line 1767, in load
self._really_load(f, filename, ignore_discard, ignore_expires)
File "C:\Python32\lib\http\cookiejar.py", line 1997, in _really_load
filename)
http.cookiejar.LoadError: 'cookies.txt' does not look like a Netscape format cookies file
I've looked around for documentation on the Netscape cookie file format, but I can't find anything that shows me the problem. Are there newlines that need to be included? I changed the line endings to Unix style, just in case, but that didn't solve the problem. The closest specification I can find is this, which doesn't indicate anything to me that I'm missing. The fields on each of the last four lines are separated by tabs, not spaces, and everything else looks correct to me.
I see nothing in your example code or copy of the cookies.txt file that is obviously wrong.
I've checked the source code for the MozillaCookieJar._really_load
method, which throws the exception that you see.
The first thing this method does, is read the first line of the file you specified (using f.readline()
) and use re.search
to look for the regular expression pattern "#( Netscape)? HTTP Cookie File"
. This is what fails for your file.
It certainly looks like your cookies.txt
would match that format, so the error you see is quite surprising.
Note that your file is opened with a simple open(filename)
call earlier on, so it'll be opened in text mode with universal line ending support, meaning it doesn't matter that you are running this on Windows. The code will see \n
newline terminated strings, regardless of what newline convention was used in the file itself.
What I'd do in this case is triple-check that your file's first line is really correct. It needs to either contain "# HTTP Cookie File" or "# Netscape HTTP Cookie File" (spaces only, no tabs, between the words, capitalisation matching). Test this with the python prompt:
>>> f = open('cookies.txt')
>>> line = f.readline()
>>> line
'# Netscape HTTP Cookie File\n'
>>> import re
>>> re.search("#( Netscape)? HTTP Cookie File", line)
<_sre.SRE_Match object at 0x10fecfdc8>
Python echoed the line representation back to me when I typed line
at the prompt, including the \n
newline character. Any surprises like tab characters or unicode zero-width spaces will show up there as escape codes. I also verified that the regular expression used by the cookiejar
code matches.
You can also use the pdb
python debugger to verify what the http.cookiejar
module really does:
>>> import pdb
>>> import http.cookiejar
>>> jar = http.cookiejar.MozillaCookieJar('cookies.txt')
>>> pdb.run('jar.load()')
> <string>(1)<module>()
(Pdb) s
--Call--
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1759)load()
-> def load(self, filename=None, ignore_discard=False, ignore_expires=False):
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1761)load()
-> if filename is None:
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1762)load()
-> if self.filename is not None: filename = self.filename
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1765)load()
-> f = open(filename)
(Pdb) n
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1766)load()
-> try:
(Pdb)
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1767)load()
-> self._really_load(f, filename, ignore_discard, ignore_expires)
(Pdb) s
--Call--
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1989)_really_load()
-> def _really_load(self, f, filename, ignore_discard, ignore_expires):
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1990)_really_load()
-> now = time.time()
(Pdb) n
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1992)_really_load()
-> magic = f.readline()
(Pdb)
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1993)_really_load()
-> if not self.magic_re.search(magic):
(Pdb)
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1999)_really_load()
-> try:
In the above sample pdb session I used a combination of the step
and next
commands to verify that the regular expression test (self.magic_re.search(magic)
) actually passed.
这篇关于为什么Python说这个Netscape cookie文件是无效的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!