python urllib2 无法获取 google url [英] python urllib2 can't get google url
问题描述
我很难用 python 的 urllib2 获取这个 url 的结果页面:
<预> <代码> http://www.google.com/search?tbs=sbi:AMhZZitAaz7goe6AsfVSmFw1sbwsmX0uIjeVnzKHjEXMck70H3j32Q-6FApxrhxdSyMo0OedyWkxk3-qYbyf0q1OqNspjLu8DlyNnWVbNjiKGo87QUjQHf2_1idZ1q_1vvm5gzOCMpChYiKsKYdMywOLjJzqmzYoJNOU2UsTs_1zZGWjU-LsjdFXt_1D5bDkuyRK0YbsaLVcx4eEk_1KMkcJpWlfFEfPMutxTLGf1zxD-9DFZDzNOODs0oj2j_1KG8FRCaMFnTzAfTdl7JfgaDf_1t5Vti8FnbeG9i7qt9wF6P-QK9mdvC15hZ5UR29eQdYbcD1e4woaOQCmg8Q1VLVPf4-kf8dAI7p3jM_1MkBBwaxdt_1TsM4FLwh0oHAYKOS5qBRI28Vs0aw5_1C5-WR4dC902Eqm5eAkLiQyAM9J2bioR66g3tMWe-j9Hyh1ID40R1NyXEJDHcGxp7xOn_16XxfW_1Cq5ArdSNzxFvABb1UcXCn5s4_1LpXZxhZbauwaO8cg3CKGLUvl_1wySDB7QIkMIF2ZInEPS4K-eyErVKqOdY9caYUD8X7oOf6sDKFjT7pNHwlkXiuYbKBRYjlvRHPlcPN1WHWCJWdSNyXdZhwDI3VRaKwmi4YNvkryeNMMbhGytfvlNaaelKcOzWbvzCtSNaP2lJziN1x3btcIAplPcoZxEpb0cDlQwId3A5FDhczxpVbdRnOB-Xeq_1AiUTt_1iI6bSgUAinWXQFYWveTOttdSNCgK-VTxV4OCtlrCrZerk27RBLAzT0ol9NOfYmYhiabzhUczWk4NuiVhKN-M4eo76cAsi74PY4V_1lWjvOpI35V_1YLJQrm0fxVcD34wxFYCIllT2gYW09fj3cuBDMNbsaJqPVQ04OOGlwmcmJeAnK96xd_1aMUd6FsVLOSDS7RfS5MNUSyd1jnXvRU_1MF_1Dj8oC8sm7PfVdjm3firiMcaKM28j9kGWbY0heIGLtO_1m6ad-iKfxYEzSux2b5w62LQlP57yS7vX8RFoyKzHA0RrFIEbPBQdNMA3Vpw0G_1LvEjCAPSCV1HH1pDp0l4EnNCvUIAppVXzNMyWT_1gKITj1NLqAn-Z1tH323JwZSc77OftDSreyHJ-BPxn3n7JMkNZFcQx6S7tfBxeqJ1NuDlpax11pw0_1Oi_1nF3vyEP0NbGKSVgNvBv_1tv8ahxvrHn9UnP78FleiOpzUBfdfRPZiT20VEq5-oXtV_1XwIzrd-5_15-cf2yoL7ohyPuv3WKGUGr4YCsYje7_1D8VslqMPsvbwMg9haj3TrBKH7go70ZfPjUv3h1K7lplnnCdV0hrYVQkSLUY1eEor3L - Vu5PlewS60ZH5YEn4qTnDxniV95h8q0Y3RWXJ6gIXitR5y6CofVg我使用以下标题,我认为这应该很简单:
headers = {'Host':'www.google.com','User-Agent':user_agent,'Accept-Language':'en-us,en;q=0.5','Accept-编码':'gzip, deflate','Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.7','Connection':'keep-alive','Referer':'http://www.google.co.in/imghp?hl=en&tab=ii','Cookie':'PREF=ID=1d7bc4ff2a5d8bc6:U=1d37ba5a518b9be1:FF=4:LD=en:TM=1300950025:LM=1302071720:S=rkk0IbbhxUIgpTyA;NID=51=uNq6mZ385WlV1UTfXsiWkSgnsa6PdjH4l9ph-vSQRszBHRcKW3VRJclZLd2XUEdZtxiCtl5hpbJiS3SpEV7670w_x738h75akcO6ViwZfy4vcltcle;SID=DQAAAMEAAACoYm-3B2aiLKf0cRU8spJuiNjiXEQRyxsUZqKf8UXZXS55movrnTmfEcM6FYn-gALmyMPNRIwLDBojINzkv8doX69rUQ9-'}
当我执行以下操作时,我得到的结果不包含任何普通网络浏览器返回的内容:
request=urllib2.Request(url,,None,headers)响应= urllib2.urlopen(请求)html=response.read()
同样,这段代码返回一堆我看不懂的十六进制垃圾:
request=urllib2.Request(url,headers=headers)响应= urllib2.urlopen(请求)html=response.read()
请帮忙,因为我很确定这很简单,我一定是遗漏了一些东西.我能够以类似的方式获取此链接,但还使用以下代码将图片上传到 images.google.com:
导入 httplib、mimetypes、android、sys、urllib2、urllib、simplejsondef post_multipart(host, selector, fields, files):"""将字段和文件作为 multipart/form-data 发布到 http 主机.fields 是常规表单字段的 (name, value) 元素序列.files 是要作为文件上传的数据的(名称、文件名、值)元素的序列返回服务器的响应页面."""content_type, body = encode_multipart_formdata(字段,文件)h = httplib.HTTP(主机)h.putrequest('POST', 选择器)h.putheader('content-type', content_type)h.putheader('content-length', str(len(body)))h.endheaders()h.发送(正文)错误代码,错误消息,标头 = h.getreply()返回 h.file.read()def encode_multipart_formdata(字段,文件):"""fields 是常规表单字段的 (name, value) 元素序列.files 是要作为文件上传的数据的(名称、文件名、值)元素的序列为 httplib.HTTP 实例准备好返回 (content_type, body)"""边界 = '----------ThIs_Is_tHe_bouNdaRY_$'CRLF = '\r\n'L = []for (key, value) 在字段中:L.append('--' + BOUNDARY)L.append('Content-Disposition: form-data; name="%s"' % key)L.append('')L.append(值)for (key, filename, value) 在文件中:L.append('--' + BOUNDARY)L.append('Content-Disposition: form-data; name="%s"; filename="%s"' % (key, filename))L.append('Content-Type: %s' % get_content_type(filename))L.append('')L.append(值)L.append('--' + BOUNDARY + '--')L.append('')正文 = CRLF.join(L)content_type = 'multipart/form-data;边界=%s' % 边界返回内容类型,正文def get_content_type(文件名):返回 mimetypes.guess_type(filename)[0] 或 'application/octet-stream'主机 = 'www.google.co.in'选择器 = '/searchbyimage/upload'fields = [('user-agent','Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2'),('connection','keep-alive'),('引用','')]使用 open('jpeg.jpg', 'rb') 作为 jpeg:文件 = [('encoded_image', 'jpeg.jpg', jpeg.read())]响应= post_multipart(主机,选择器,字段,文件)#添加:响应=responseLen=(len(response)-1)x=22如果响应[(x-21):(x+1)]!='EF=\"http://www.google':x+=1x+=145链接=''而 response[(x+1):(x+7)]!='amp;us': #>here<链接=链接+响应[x]x+=1打印(链接)
上面的代码返回的不是浏览器会返回的页面,而是带有链接已移动"的 html,这是我在此消息中首先发布的url".如果我可以上传我的图片并返回结果页面,为什么我无法获得结果链接 html 页面?真的很郁闷:(
请帮忙,这个问题我已经烧了一个多月了.是的,我是新手,但我认为这很简单:(
请帮我返回这个小网址的结果页:
<预> <代码> http://www.google.com/search?tbs=sbi:AMhZZitAaz7goe6AsfVSmFw1sbwsmX0uIjeVnzKHjEXMck70H3j32Q-6FApxrhxdSyMo0OedyWkxk3-qYbyf0q1OqNspjLu8DlyNnWVbNjiKGo87QUjQHf2_1idZ1q_1vvm5gzOCMpChYiKsKYdMywOLjJzqmzYoJNOU2UsTs_1zZGWjU-LsjdFXt_1D5bDkuyRK0YbsaLVcx4eEk_1KMkcJpWlfFEfPMutxTLGf1zxD-9DFZDzNOODs0oj2j_1KG8FRCaMFnTzAfTdl7JfgaDf_1t5Vti8FnbeG9i7qt9wF6P-QK9mdvC15hZ5UR29eQdYbcD1e4woaOQCmg8Q1VLVPf4-kf8dAI7p3jM_1MkBBwaxdt_1TsM4FLwh0oHAYKOS5qBRI28Vs0aw5_1C5-WR4dC902Eqm5eAkLiQyAM9J2bioR66g3tMWe-j9Hyh1ID40R1NyXEJDHcGxp7xOn_16XxfW_1Cq5ArdSNzxFvABb1UcXCn5s4_1LpXZxhZbauwaO8cg3CKGLUvl_1wySDB7QIkMIF2ZInEPS4K-eyErVKqOdY9caYUD8X7oOf6sDKFjT7pNHwlkXiuYbKBRYjlvRHPlcPN1WHWCJWdSNyXdZhwDI3VRaKwmi4YNvkryeNMMbhGytfvlNaaelKcOzWbvzCtSNaP2lJziN1x3btcIAplPcoZxEpb0cDlQwId3A5FDhczxpVbdRnOB-Xeq_1AiUTt_1iI6bSgUAinWXQFYWveTOttdSNCgK-VTxV4OCtlrCrZerk27RBLAzT0ol9NOfYmYhiabzhUczWk4NuiVhKN-M4eo76cAsi74PY4V_1lWjvOpI35V_1YLJQrm0fxVcD34wxFYCIllT2gYW09fj3cuBDMNbsaJqPVQ04OOGlwmcmJeAnK96xd_1aMUd6FsVLOSDS7RfS5MNUSyd1jnXvRU_1MF_1Dj8oC8sm7PfVdjm3firiMcaKM28j9kGWbY0heIGLtO_1m6ad-iKfxYEzSux2b5w62LQlP57yS7vX8RFoyKzHA0RrFIEbPBQdNMA3Vpw0G_1LvEjCAPSCV1HH1pDp0l4EnNCvUIAppVXzNMyWT_1gKITj1NLqAn-Z1tH323JwZSc77OftDSreyHJ-BPxn3n7JMkNZFcQx6S7tfBxeqJ1NuDlpax11pw0_1Oi_1nF3vyEP0NbGKSVgNvBv_1tv8ahxvrHn9UnP78FleiOpzUBfdfRPZiT20VEq5-oXtV_1XwIzrd-5_15-cf2yoL7ohyPuv3WKGUGr4YCsYje7_1D8VslqMPsvbwMg9haj3TrBKH7go70ZfPjUv3h1K7lplnnCdV0hrYVQkSLUY1eEor3L - Vu5PlewS60ZH5YEn4qTnDxniV95h8q0Y3RWXJ6gIXitR5y6CofVg戴夫
您的用户代理未定义!
拿那个:
#!/usr/bin/python导入 urllib2url = "http://www.google.com/search?q=mysearch";开瓶器 = urllib2.build_opener()opener.addheaders = [('用户代理', 'Mozilla/5.0')]打印 opener.open(url).read()原始输入()
如果你想找一个其他的用户代理,你可以在 Firefox 中编写 about:config
.并搜索用户代理":
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050511
Googlebot/2.1 (+http://www.google.com/bot.html)
Opera/7.23(Windows 98;U)[en]
I'm having a really tough time with getting the results page of this url with python's urllib2:
http://www.google.com/search?tbs=sbi:AMhZZitAaz7goe6AsfVSmFw1sbwsmX0uIjeVnzKHjEXMck70H3j32Q-6FApxrhxdSyMo0OedyWkxk3-qYbyf0q1OqNspjLu8DlyNnWVbNjiKGo87QUjQHf2_1idZ1q_1vvm5gzOCMpChYiKsKYdMywOLjJzqmzYoJNOU2UsTs_1zZGWjU-LsjdFXt_1D5bDkuyRK0YbsaLVcx4eEk_1KMkcJpWlfFEfPMutxTLGf1zxD-9DFZDzNOODs0oj2j_1KG8FRCaMFnTzAfTdl7JfgaDf_1t5Vti8FnbeG9i7qt9wF6P-QK9mdvC15hZ5UR29eQdYbcD1e4woaOQCmg8Q1VLVPf4-kf8dAI7p3jM_1MkBBwaxdt_1TsM4FLwh0oHAYKOS5qBRI28Vs0aw5_1C5-WR4dC902Eqm5eAkLiQyAM9J2bioR66g3tMWe-j9Hyh1ID40R1NyXEJDHcGxp7xOn_16XxfW_1Cq5ArdSNzxFvABb1UcXCn5s4_1LpXZxhZbauwaO8cg3CKGLUvl_1wySDB7QIkMIF2ZInEPS4K-eyErVKqOdY9caYUD8X7oOf6sDKFjT7pNHwlkXiuYbKBRYjlvRHPlcPN1WHWCJWdSNyXdZhwDI3VRaKwmi4YNvkryeNMMbhGytfvlNaaelKcOzWbvzCtSNaP2lJziN1x3btcIAplPcoZxEpb0cDlQwId3A5FDhczxpVbdRnOB-Xeq_1AiUTt_1iI6bSgUAinWXQFYWveTOttdSNCgK-VTxV4OCtlrCrZerk27RBLAzT0ol9NOfYmYhiabzhUczWk4NuiVhKN-M4eo76cAsi74PY4V_1lWjvOpI35V_1YLJQrm0fxVcD34wxFYCIllT2gYW09fj3cuBDMNbsaJqPVQ04OOGlwmcmJeAnK96xd_1aMUd6FsVLOSDS7RfS5MNUSyd1jnXvRU_1MF_1Dj8oC8sm7PfVdjm3firiMcaKM28j9kGWbY0heIGLtO_1m6ad-iKfxYEzSux2b5w62LQlP57yS7vX8RFoyKzHA0RrFIEbPBQdNMA3Vpw0G_1LvEjCAPSCV1HH1pDp0l4EnNCvUIAppVXzNMyWT_1gKITj1NLqAn-Z1tH323JwZSc77OftDSreyHJ-BPxn3n7JMkNZFcQx6S7tfBxeqJ1NuDlpax11pw0_1Oi_1nF3vyEP0NbGKSVgNvBv_1tv8ahxvrHn9UnP78FleiOpzUBfdfRPZiT20VEq5-oXtV_1XwIzrd-5_15-cf2yoL7ohyPuv3WKGUGr4YCsYje7_1D8VslqMPsvbwMg9haj3TrBKH7go70ZfPjUv3h1K7lplnnCdV0hrYVQkSLUY1eEor3L--Vu5PlewS60ZH5YEn4qTnDxniV95h8q0Y3RWXJ6gIXitR5y6CofVg
I use the following headers, and this should be simple I would think:
headers = {'Host':'www.google.com','User-Agent':user_agent,'Accept-Language':'en-us,en;q=0.5','Accept-Encoding':'gzip, deflate','Accept-Charset':'ISO-8859-1,utf-8;q=0.7,*;q=0.7','Connection':'keep-alive','Referer':'http://www.google.co.in/imghp?hl=en&tab=ii','Cookie':'PREF=ID=1d7bc4ff2a5d8bc6:U=1d37ba5a518b9be1:FF=4:LD=en:TM=1300950025:LM=1302071720:S=rkk0IbbhxUIgpTyA; NID=51=uNq6mZ385WlV1UTfXsiWkSgnsa6PdjH4l9ph-vSQRszBHRcKW3VRJclZLd2XUEdZtxiCtl5hpbJiS3SpEV7670w_x738h75akcO6Viw47MUlpCZfy4KZ2vLT4tcleeiW; SID=DQAAAMEAAACoYm-3B2aiLKf0cRU8spJuiNjiXEQRyxsUZqKf8UXZXS55movrnTmfEcM6FYn-gALmyMPNRIwLDBojINzkv8doX69rUQ9-'}
When I do the following, I get a result that doesn't contain what any ordinary web browser returns:
request=urllib2.Request(url,,None,headers)
response=urllib2.urlopen(request)
html=response.read()
Similarly, this bit of code returns a bunch of hex junk I can't read:
request=urllib2.Request(url,headers=headers)
response=urllib2.urlopen(request)
html=response.read()
Please help, as I am quite sure this is simple enough, and I must just be missing something. I was able to get this link in a similar way, but also uploading an image to images.google.com using the following code:
import httplib, mimetypes, android, sys, urllib2, urllib, simplejson
def post_multipart(host, selector, fields, files):
"""
Post fields and files to an http host as multipart/form-data.
fields is a sequence of (name, value) elements for regular form fields.
files is a sequence of (name, filename, value) elements for data to be uploaded as files
Return the server's response page.
"""
content_type, body = encode_multipart_formdata(fields, files)
h = httplib.HTTP(host)
h.putrequest('POST', selector)
h.putheader('content-type', content_type)
h.putheader('content-length', str(len(body)))
h.endheaders()
h.send(body)
errcode, errmsg, headers = h.getreply()
return h.file.read()
def encode_multipart_formdata(fields, files):
"""
fields is a sequence of (name, value) elements for regular form fields.
files is a sequence of (name, filename, value) elements for data to be uploaded as files
Return (content_type, body) ready for httplib.HTTP instance
"""
BOUNDARY = '----------ThIs_Is_tHe_bouNdaRY_$'
CRLF = '\r\n'
L = []
for (key, value) in fields:
L.append('--' + BOUNDARY)
L.append('Content-Disposition: form-data; name="%s"' % key)
L.append('')
L.append(value)
for (key, filename, value) in files:
L.append('--' + BOUNDARY)
L.append('Content-Disposition: form-data; name="%s"; filename="%s"' % (key, filename))
L.append('Content-Type: %s' % get_content_type(filename))
L.append('')
L.append(value)
L.append('--' + BOUNDARY + '--')
L.append('')
body = CRLF.join(L)
content_type = 'multipart/form-data; boundary=%s' % BOUNDARY
return content_type, body
def get_content_type(filename):
return mimetypes.guess_type(filename)[0] or 'application/octet-stream'
host = 'www.google.co.in'
selector = '/searchbyimage/upload'
fields = [('user-agent','Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2'),('connection','keep-alive'),('referer','')]
with open('jpeg.jpg', 'rb') as jpeg:
files = [('encoded_image', 'jpeg.jpg', jpeg.read())]
response = post_multipart(host, selector, fields, files) #added: response =
responseLen=(len(response)-1)
x=22
if response[(x-21):(x+1)]!='EF=\"http://www.google':
x+=1
x+=145
link=''
while response[(x+1):(x+7)]!='amp;us': #>here<
link=link+response[x]
x+=1
print(link)
The above code returned not the page a browser would return, but instead html with a "link that has moved", which is the 'url' I posted first in this message. If I can do the upload of my image and return a results page, why can't I get the resulting links html page? It's severely frustrating:(
Please help, I've been burning out my brain for over a month on this problem. Yes I am a newbee, but I thought this would be straightforward:(
Please help me to return the results page of this one little url:
http://www.google.com/search?tbs=sbi:AMhZZitAaz7goe6AsfVSmFw1sbwsmX0uIjeVnzKHjEXMck70H3j32Q-6FApxrhxdSyMo0OedyWkxk3-qYbyf0q1OqNspjLu8DlyNnWVbNjiKGo87QUjQHf2_1idZ1q_1vvm5gzOCMpChYiKsKYdMywOLjJzqmzYoJNOU2UsTs_1zZGWjU-LsjdFXt_1D5bDkuyRK0YbsaLVcx4eEk_1KMkcJpWlfFEfPMutxTLGf1zxD-9DFZDzNOODs0oj2j_1KG8FRCaMFnTzAfTdl7JfgaDf_1t5Vti8FnbeG9i7qt9wF6P-QK9mdvC15hZ5UR29eQdYbcD1e4woaOQCmg8Q1VLVPf4-kf8dAI7p3jM_1MkBBwaxdt_1TsM4FLwh0oHAYKOS5qBRI28Vs0aw5_1C5-WR4dC902Eqm5eAkLiQyAM9J2bioR66g3tMWe-j9Hyh1ID40R1NyXEJDHcGxp7xOn_16XxfW_1Cq5ArdSNzxFvABb1UcXCn5s4_1LpXZxhZbauwaO8cg3CKGLUvl_1wySDB7QIkMIF2ZInEPS4K-eyErVKqOdY9caYUD8X7oOf6sDKFjT7pNHwlkXiuYbKBRYjlvRHPlcPN1WHWCJWdSNyXdZhwDI3VRaKwmi4YNvkryeNMMbhGytfvlNaaelKcOzWbvzCtSNaP2lJziN1x3btcIAplPcoZxEpb0cDlQwId3A5FDhczxpVbdRnOB-Xeq_1AiUTt_1iI6bSgUAinWXQFYWveTOttdSNCgK-VTxV4OCtlrCrZerk27RBLAzT0ol9NOfYmYhiabzhUczWk4NuiVhKN-M4eo76cAsi74PY4V_1lWjvOpI35V_1YLJQrm0fxVcD34wxFYCIllT2gYW09fj3cuBDMNbsaJqPVQ04OOGlwmcmJeAnK96xd_1aMUd6FsVLOSDS7RfS5MNUSyd1jnXvRU_1MF_1Dj8oC8sm7PfVdjm3firiMcaKM28j9kGWbY0heIGLtO_1m6ad-iKfxYEzSux2b5w62LQlP57yS7vX8RFoyKzHA0RrFIEbPBQdNMA3Vpw0G_1LvEjCAPSCV1HH1pDp0l4EnNCvUIAppVXzNMyWT_1gKITj1NLqAn-Z1tH323JwZSc77OftDSreyHJ-BPxn3n7JMkNZFcQx6S7tfBxeqJ1NuDlpax11pw0_1Oi_1nF3vyEP0NbGKSVgNvBv_1tv8ahxvrHn9UnP78FleiOpzUBfdfRPZiT20VEq5-oXtV_1XwIzrd-5_15-cf2yoL7ohyPuv3WKGUGr4YCsYje7_1D8VslqMPsvbwMg9haj3TrBKH7go70ZfPjUv3h1K7lplnnCdV0hrYVQkSLUY1eEor3L--Vu5PlewS60ZH5YEn4qTnDxniV95h8q0Y3RWXJ6gIXitR5y6CofVg
Dave
Your user-agent is not defined !
Take that one :
#!/usr/bin/python
import urllib2
url = "http://www.google.com/search?q=mysearch";
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
print opener.open(url).read()
raw_input()
If you like find an other user-agent, you can write about:config
in the Firefox.
And search "user-agent" :
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050511
Googlebot/2.1 (+http://www.google.com/bot.html)
Opera/7.23 (Windows 98; U) [en]
这篇关于python urllib2 无法获取 google url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!